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Preface 



These are the proceedings of CHES 2001, the third Workshop on Cryptographic 
Hardware and Embedded Systems. The first two CHES Workshops were held in 
Massachusetts, and this was the first Workshop to be held in Europe. There was 
a large number of submissions this year, and in response the technical program 
was extended to 2 1/2 days. 

As is evident by the papers in these proceedings, many excellent submissions 
were made. Selecting the papers for this year’s CHES was not an easy task, and 
we regret that we had to reject several very intersting papers due to the lack of 
time. There were 66 submitted contributions this year, of which 31, or 47%, were 
selected for presentation. If we look at the number of submitted papers at CHES 
’99 (42 papers) and CHES 2001 (51 papers), we observe a steady increase. We 
interpret this as a continuing need for a workshop series which combines theory 
and practice for integrating strong security features into modern communications 
and computer applications. In addition to the submitted contributions, Ross 
Anderson from Cambridge University, UK, and Adi Shamir from The Weizmann 
Institute, Israel, gave invited talks. 

As in previous years, the focus of the workshop is on all aspects of crypto- 
graphic hardware and embedded system design. Of special interest were contri- 
butions that describe new methods for efficient hardware implementations and 
high-speed software for embedded systems, e.g., smart cards, microprocessors, 
DSPs, etc. CHES also continues to be an important forum for new theoretical 
and practical findings in the important and growing field of side-channel attacks. 

We hope to continue to make the CHES workshop series a forum of intel- 
lectual exchange in creating secure, reliable, and robust security solutions of 
tomorrow. CHES Workshops will continue to deal with hardware and software 
implementations of security functions and systems, including security for em- 
bedded wireless ad-hoc networks. 

We thank everyone whose involvement made the CHES Workshop such a 
successful event, in particular we would like to thank Andre Weimerskirch from 
WPI, and Delphine Abecassis and Cecile Osta from Novamedia for their efforts. 
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Protecting Embedded Systems 
The Next Ten Years 



Ross Anderson 

Computer Laboratory, 
Pembroke Street, Cambridge, England 
Ross . AndersonScl .cam.ac.uk 



Abstract. In this talk, I will speculate about the likely near-term and 
medium-term scientific developments in the protection of embedded sys- 
tems. 

A common view of the Internet divides its history into three waves, the 
first being centered around mainframes and terminals, and the second 
(from about 1992 until now) on PCs, browsers, and a GUI. The third 
wave, starting now, will see the connection of all sorts of devices that are 
currently in proprietary networks, standalone, or even non-computerized. 
By the end of 2003, there might well be more mobile phones connected 
to the Internet than computers. Within a few years we will see many of 
the world’s fridges, heart monitors, bus ticket dispensers, burglar alarms, 
and electricity meters talking IP. By 2010, ‘ubiquitous computing’ will 
be part of our lives. 

Some of the likely effects of ubiquitous computing are already apparent. 
For example, applications with intermittent connectivity will have to 
maintain much of their security state locally rather than globally. This 
will create new markets for processors with appropriate levels of tamper- 
resistance. But what will this mean? 

I will discuss protection requirements at four levels. 

Invasive attacks on hardware are likely to remain possible for capa- 
ble motivated opponents, at least for devices that cannot be fur- 
nished with effective tamper responding barriers. That said, even 
commodity smartcards are much harder to probe than was the case 
five years ago. Decreasing feature sizes, 32-bit processors, and lay- 
out that makes bus lines harder to find and to probe, all combine to 
push up the entry cost. Attacks that could be done in a few weeks 
with ten thousand dollars’ worth of equipment now take months and 
require access to equipment costing several hundred thousand dol- 
lars. However, this field rides on the coat-tails of the semiconductor 
test industry, and will remain unpredictable. Every so often, bright 
ideas lead to powerful new low-cost testing tools, that may be used 
in attacks. The scanning capacitance microscope may be one such. 
Non-invasive attacks on hardware - such as power and glitch at- 
tacks - might become infeasible against even the smallest proces- 
sors. However, this is not as easy as it seemed three or four years 
ago. Current techniques, such as randomised clocking, can only do 
so much. New ideas are needed, and I will discuss an EU-funded 
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research project (GSCard) to develop these. Its goal is produce a 
prototype smartcard CPU that is inherently resistant to noninva- 
sive attacks. The prototypes currently being designed at Cambridge 
under C3Card use asynchronous (self-timed) dual-rail logic, which 
holds out the prospect of power consumption that is independent of 
the data being processed. This technology holds out the prospect of 
important side benefits as well, such as reduced RFI/EMI and lower 
power consumption. 

Protocol-level attacks continue to be a terrible problem. The design 
of ordinary authentication protocols is well known to be hard; yet a 
typical cryptographic processor performs much more than one pro- 
tocol. Its API may have to support somewhere between a few dozen 
and a few hundred different cryptographic transactions. The paper in 
these proceedings by Mike Bond shows that attacks can be found on 
even the most mature and thoroughly-studied cryptographic APIs. 
Developing the tools and concepts to design robust cryptographic 
APIs looks set to be a major research challenge for some years to 
come, and may be the next big topic for the protocol research com- 
munity. 

Business process failures are coming to be recognised as perhaps the 
main cause of attacks on real systems. Once the principal providing 
the protection is no longer the same as the principal who will suf- 
fer loss if it fails, things become messy. While a traditional mono- 
lithic pay-TV operator might have owned the smartcard designer, 
the satellite transponder, the set-top boxes and indeed the entire 
customer base, things are now becoming much more fragmented. 
Design, evaluation, implementation and operations are being ever 
more widely distributed, and this is starting to introduce serious 
evaluation and assurance issues. There are also economic issues such 
as network externalities, asymmetric information, moral hazard, ad- 
verse selection, liability dumping and the tragedy of the commons. 
The above themes interact in unexpected ways. For example, even a com- 
pletely tamper-proof chip can have its design read out by a litigation at- 
tack; the attacker buys a vaguely relevant patent, brings a lawsuit against 
the device designer for infringement, and obtains full design details as 
part of the legal discovery process. This may be a further argument in 
favour of Kerckhoffs’ principle. On the other hand, a highly obscure de- 
sign can greatly complicate matters for an attacker whose tools allow 
him to observe only partial information about the computations being 
undertaken. 

Ultimately, though, information security is about power. While at the 
technical level it is about controlling who may use which resource and 
how, while at the level of business strategy it is increasingly about raising 
barriers to trade, segmenting markets and differentiating products. A 
final point is that sometimes insecurity is welcome. For example, it may 
foster economic growth by making monopolies harder to defend. 
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Abstract. Since the announcement of the Differential Power Analysis 
(DPA) by Paul Kocher and ah, several countermeasures were proposed in 
order to protect software implementations of cryptographic algorithms. 
In an attempt to reduce the resulting memory and execution time over- 
head, a general method was recently proposed, consisting in “masking” 
all the intermediate data. 

This masking strategy is possible if all the fundamental operations used 
in a given algorithm can be rewritten with masked input data, giving 
masked output data. This is easily seen to be the case in classical algo- 
rithms such as DES or RSA. 

However, for algorithms that combine boolean and arithmetic functions, 
such as IDEA or several of the AES candidates, two different kinds of 
masking have to be used. There is thus a need for a method to convert 
back and forth between boolean masking and arithmetic masking. 

A first solution to this problem was proposed by Thomas Messerges in 
[15], but was unfortunately shown (see [6]) insufficient to prevent DPA. 
In the present paper, we present two new practical algorithms for the 
conversion, that are proven secure against DPA. 

The first one ( “BooleanTo Arithmetic” ) uses a constant number of el- 
ementary operations, namely 7, on the registers of the processor. The 
number of elementary operations for the second one ( “ArithmeticTo- 
Boolean”), namely 5A -I- 5, is proportional to the size K (in bits) of the 
processor registers. 

Key words: Physical attacks. Differential Power Analysis, Electric con- 
sumption, AES, IDEA, Smartcards, Masking Techniques. 



1 Introduction 

Paul Kocher and al. introduced in 1998 ([12]) and published in 1999 ([13]) the 
concept of Differential Power Analysis attack, also known as DPA. The initial 
focus was on symmetrical cryptosystems such as DES (see [12,16]) and the AES 
candidates (see [1,3,7]), but public key cryptosystems have since been shown to 
be also vulnerable to the DPA attacks (see [17,5,11]). 

In [10,11], Goubin and Patarin proposed a generic countermeasure consist- 
ing in splitting all the intermediate variables. A similar “duplication” method 
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was suggested shortly after by Chari and al. in [3] and [4]. Although the au- 
thors of [3] state that these general methods generally increase dramatically 
the amount of memory needed, or the computation time, Goubin and Patarin 
proved that realistic implementations could be reached with the “duplication” 
method. However, it has been shown in [9] that even inner rounds can be aimed 
by “Power- Analysis” -type attacks, so that the splitting should be performed on 
all rounds of the algorithm. This makes the issue of the memory and time com- 
putation overhead even more crucial, especially for embedded systems such as 
smartcards. 

In [15], Thomas Messerges investigated on DPA attacks applied on the AES 
candidates. He developped a general countermeasure, consisting in masking all 
the inputs and outputs of each elementary operation used by the microprocessor. 
This generic technique allowed him to evaluate the impact of these countermea- 
sures on the five AES algorithms. 

However, for algorithms that combine boolean and arithmetic functions, two 
different kinds of masking have to be used. There is thus a need for a method to 
convert back and forth between boolean masking and arithmetic masking. This 
is typically the case for IDEA [14] and for three AES candidates: MARS [2], 
RC6 [18] and Twofish [19]. 

T. Messerges proposed in [15] an algorithm in order to perform this conver- 
sion between a “© mask” and a “-I- mask”. Unfortunately, Coron and Goubin 
described in [6] a specific attack, showing that the “BooleanToArithmetic” al- 
gorithm proposed by T. Messerges is not sufficient to prevent Differential Power 
Analysis. In a similar way, his “ArithmeticToBoolean” algorithm is not secure 
either. 

In the present paper, we present two new “BooleanToArithmetic” and “Arith- 
meticToBoolean” algorithms, proven secure against DPA attacks. Each of these 
algorithms uses only very simple operations: “XOR”, “AND”, subtractions and 
“logical shift left” . Our “BooleanToArithmetic” algorithm uses a constant num- 
ber (namely 7) of such elementary operations, whereas the number of elemen- 
tary operations involved in our “ArithmeticToBoolean” algorithm is proportional 
(namely equal to 5K + 5) to the size (i.e. the number K of bits) of the processor 
registers. 



2 Background 

2.1 The “Differential Power Analysis” Attack 

The “Differential Power Analysis” (DPA) is an attack that allows to obtain 
information about the secret key (contained in a smartcard for example), by 
performing a statistical analysis of the electric consumption records measured 
for a large number of computations with the same key. 

This attack does not require any knowledge about the individual electric 
consumption of each instruction, nor about the position in time of each of these 
instructions. It applies exactly the same way as soon as the attacker knows the 
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outputs of the algorithm and the corresponding consumption curves. It only 
relies on the following fundamental hypothesis: 

Fundamental hypothesis; There exists an intermediate variable, that ap- 
pears during the computation of the algorithm, such that knowing a few key bits 
(in practice less than 32 bits) allows us to decide whether two inputs (respectively 
two outputs) give or not the same value for this variable. 

2.2 The Masking Method 

In the present paper, we focus on the “masking method”, initially suggested by 
Goubin and Patarin in [10], and studied further in [11]. 

The basic principle consists in programming the algorithm so that the fun- 
damental hypothesis above is not true any longer (i.e. an intermediate variable 
never depends on the knowledge of an easily accessible subset of the secret key) . 
More precisely, using a secret sharing scheme, each intermediate variable that 
appears in the cryptographic algorithm is splitted. Therefore, an attacker has to 
analyze multiple point distributions, which makes his task grow exponentially 
in the number of elements in the splitting. 

2.3 The Conversion Problem 

For algorithms that combine boolean and arithmetic functions, two different 
kinds of masking have to be used: 

Boolean masking : x' = x (B r 

Arithmetic masking : A = x — r mod 2^ 

Here the variable x is masked with random r to give the masked value x' (or 
A) . Our goal is to find an efficient algorithm for converting from boolean masking 
to arithmetic masking and conversely, in which all intermediate variables are 
decorrelated from the data to be masked, so that it is secure against DPA. 

In all the present paper, we suppose that the processor has AT-bit registers 
(in practice, K is most of the time equal to 8, 16, 32 or 64). All the arithmetic 
operations (such as the addition the subtraction or the doubling 

“z I— >■ 2F’) are considered modulo 2^ . For simplicity, the “mod2^” will often 
be omitted in the sequel. 

3 Prom Boolean to Arithmetic Masking 

3.1 A Useful Algebraic Property 

Let I = {0, 1, 2, . . . , 2^ — 1}, with K > 1 being an integer. Let x' G I. We 
consider the function <f>x' '■ I ^ I, defined by: 

<Px>{r) = (x' (Br) — r mod 2^. 

We identify each element of I with the sequence of coefficients in its binary 
representation, so that I can be viewed as a vector space of dimension K over 
GF(2), isomorphic to GF(2)^. 
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Theorem 1 

K-^l i-1 

= ^' ® 0 [( A (2'^)) ^ ^ (2V)] , 

i=i j=i 

where x' stands for the ones complement of x' , and A stands for the boolean 
“AND” operator. 

See Appendix 1 for a proof of Theorem 1. 

Corollary 1.1 The function is affine over GF(2). 

This result is an easy consequence of Theorem 1. 

3.2 The “BooleanToArithmetic” Algorithm 

Since <Tx' is affine over GF(2), the function Tx' = d>x' © ^a;'(0) is linear over 
GF(2). Therefore, for any value 7, 

'I'x'ir) = Tx'ij® (r©7)) = Tx>{j) © (^ © 7) • 

Corollary 1.2 For any value 7, if we denote A= {x' ®r) — r, we also have 
A = [(a:' © 7) — 7] © a;' © [{x' © (r © 7)) — (r © 7)]. 

A = (a;' © r) — r can thus be obtained from the following algorithm: 



Algorithm 1. BooleanToArithmetic 
Require: {x' , r) such that x = x' (B r 
Ensure: {A, r) such that x = A-\- r 
Initialize F to a random value 7 
T <=x' ®r 

T <^T - r 

T <i=TiBx' 
r <:= r©r 
A 4= *' © F 
A-^A-r 
A A©r 



The “BooleanToArithmetic” algorithm uses 2 auxiliary variables (T and F), 

1 random generation and 7 elementary operations (more precisely: 5 “XOR” and 

2 subtractions). 
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3.3 Proof of Security against DPA 

From the description of the “BooleanToArithmetic” algorithm, we easily obtain 
the list of all the intermediate values Vg, Vg that appear during the compu- 
tation of A = {x' © r) — r: 

' Vo = 7 
Vi = 7 © r 
V2 = © 7 

< V's = (a^' ® 7) - 7 
V4 = [{x' © 7) - 7] © a;' 

V5 = a:' © 7 © r 
. Vg = © 7 © f ) - (7 © r) 

If we suppose that 7 is randomly chosen with a uniform distribution on 
I = { 0 , 1 }*^, it is easy to see that: 

— the values Vq, Vi, V2 and Vg are uniformly distributed on I. 

— the distributions of V3, V4 and Vg depend on x' but not on r. 

4 Prom Arithmetic to Boolean Masking 

4.1 A Useful Recursion Formula 

Theorem 2 If we denote x' = (A + r) © r, we also have x' = A(B uk-i, where 
uk- 1 is obtained from the following recursion formula: 

f Mo = 0 

I Vfc > 0 , Mfc+i = 2 [Mfc A (A © r) © (A A r)\. 

See Appendix 2 for a proof of Theorem 2 . 

4.2 The “ArithmeticToBoolean” Algorithm 

Let 7 be any value. The change of variable = 2 y © leads to the following 
consequence of Theorem 2 . 

Corollary 2.1 For any value 7, if we denote x' = (A + r) © r, we also have 
x' = A©27©tj^_i, where Ik-i is obtained from the following recursion formula: 

f to = 27 

I Vfc > 0 , tk+i = 2 [tk A (A © r) © w], 

in which w = 7 © (27) A (A © r) © A A r. 

As a consequence, x' = (A + r) © r can be obtained from the “Arithmetic- 
ToBoolean” algorithm below. 

This method requires 3 auxiliary variables (T, Q and F), 1 random generation 
and ( 5 A + 5 ) elementary operations (more precisely: { 2 K + 4 ) “XOR”, ( 2 A + 1 ) 
“AND” and K “logical shift left”). 
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Algorithm 2. ArithmeticToBoolean 
Require: (A, r) such that x = A-\- r 
Ensure: {x' , r) such that x = x' Or 
Initialize E to a random value 7 
T ^ 2E 
<^= E © r 
o <= r Ax' 
x' A 

E <:= E © x' 

r <= r Ar 

i7 <:= 1 ? © E 
r-^TAA 
i7 <:= i7 © E 
for fc = 1 to A — 1 do 
r Ar 

r r o n 

T-i=TAA 
E <7= E©E 
E ^ 2E 

end for 

x' x' OT 



4.3 Proof of Security against DPA 

From the description of the “BooleanToArithmetic” algorithm, we easily obtain 
the list of all the intermediate values Wq, W^k+4 that appear during the 
computation of x' = {A + r) (B r: 

' IFp = 7 

Wi = 27 
W2 = j Or 
W3 = 7 © 7 A r 
W4 = 27 © A 
W5 = 7 © 27 © A 
IFg = (7 © 27 © A) A r 
< W7 = 7 © (27) A r © A A r 
IFs = (27) A A 

IF9 = 7 © (27) A(A©r)©AAr = u; 

{ fbsfe+s = (27 ® Uk-i) A r 
fbsfe+e = 7 © (27) A A © Uk-i A r © A A r 
fbsfe+T = (27 © Uk-i) A A 
fbsfe+s = 7 ® Uk-i A (A © r) © A A r 
thsfe+g = 27 © t6fe 



If we suppose that 7 is randomly chosen with a uniform distribution on 
I = {0, 1}*^, it is easy to see that: 
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— the values Wq, W2 and W^k+s, < K — 1 ) are uniformly distributed on 

I. 

— the values Wi and W^k+9 are uniformly distributed on the subset {0, 1}^“^ x 
{0} of I. 

— the distributions of W3 and W^k+b < K — 1 ) depend on r but not on 

A. 

— the distributions of IV4, Ws and W^k+r < k < K — 1 ) depend on A but 
not on r. 

To study the distribution of the remaining values (IT5, W3, Wj, Wg and 
bbsfc+e), we will make use of the following result: 

Theorem 3 For any S € I, the following function is hijective: 

\ 7 '->■ 7 © (27) A S. 

See Appendix 3 for a proof of Theorem 3. As a result: 

— the values W5 = 6>_i(7)©A, W^ = 6*^(7) ©AAr, kbg = 61^01.(7) ©AAr and 
kbsfc+e = 01.(7) ©Mfc_i Ar© AAr (1 < A: < K— 1 ) are uniformly distributed 
on I. 

— the distribution of W3 = (0_i(7) © A) A r depends on r but not on A. 



5 Conclusion 

In this paper, we solved the following open problem (stated in [6]): “find an 
efficient algorithm for converting from boolean masking to arithmetic masking 
and conversely, in which all intermediate variables are decorrelated from the data 
to be masked, so that it is secure against DPA” . 

The construction of our “BooleanToArithmetic” and “ArithmeticToBoolean” 
algorithms also led us to prove some results of independent interest. In particular 
we proved that r 1— >■ (a © r) — r mod 2^ is an affine function, which seems to be 
a new result. 

Finally, a direction for further research would be to find an improved version 
of the “ArithmeticToBoolean” algorithm, in which the number of elementary 
operations is less than 5AT + 5, or (even better) a constant independent of the 
size K of the registers. 
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Annex 1: Proof of Theorem 1 

To prove theorem 1, we prove the following more precise result: 
Lemma 1 For any integer k>l: 



k-l i-l 



Mr) = {« ® 0 [( A ^ ^ (2'0] } 



2=1 j = l 



k-l 



( A (2^h)) A (2'= a) A (2'=r) mod 2^, 



i=i 



where a stands for the ones complement of a, and A stands for the boolean 
“AND” operator. 



Theorem 1 easily follows from Lemma 1, by considering the particular value 
k = K (and taking a = x'). 

To prove Lemma 1, we will use the following elementary result. 

Lemma 2 For any integers u and v: 

u — V = (u(Bv) — 2{u A v) mod 2^ . 



Proof of Lemma 2 (sketch): u(Bv gives almost the same result as u — v, except 
that carries have been forgotten. For a given index, a carry appears if and only 
if a ‘1’ bit (from v) is subtracted from a ‘0’ bit (from u), which corresponds to a 
‘1’ bit in uAu = 1. Since the carry is then subtracted in the next index, u/\v has 
to be shifted left, which is the same as to be doubled, before being subtracted 
from u(Bv. 
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Proof of Lemma 1: We proceed by induction on k. 

— We first apply Lemma 2 with u = a® r and v = r: 

<Pa{r) = {a ® r) — r = a — 2(a © r A r) mod 2^. 



Since a ® r = a ® r, we have: 

^a(r) = a — 2((o © f) A r) = a — 2(a A r) mod 2^, 



which proves the case A: = 1 of Lemma 1 (conventionally, the empty product 
0 

/\ equals the identity element of the A operator). 



i=i 

Let us suppose that the result of Lemma 1 is true for k: 

fc-l i-l 

<Ln(r) = 



a{r) = |o © 0 ( /\ (2^ a)) A (2*a) A (2V) | 






fe-i 




i=i 



A (2'= a) A (2'=r) 



mod 2^ 



and let us show that it is also true for A: + 1. 

Let 

fc-l i-l 

u = a® 0 1^^ /y (2’’h)^ A (2*a) A (2V) 

i=i i=i 



and 



u = 



fc-l 




i=i 



A (2'=a) A (2'^r). 



We first obtain: 



u® V = a® 1^^ /y (2^a)^ A (2*a) A (2V) . 

i=i j=i 

Moreover, 

fc-l i-l 

M = a © 0 /y (2Jo)) A (2©) A (2v) 

i=i j=i 
k-1 i-l 

= «®0[(A ’ 

i-l 



k-1 



i-l 



U Av = 






so that: 
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fc-l 



a( /\ (2^d)) A (2'^a) A (2'=r). 



i=i 



Therefore 



k-l 



: A V = ( /\ (2^ d)) A (2'= a) A (2'=r) 



j=0 

because to each index i, linrt corresponds an index j, 1 < j < 

fc — 1 in V (namely j = i), such that: 

(2* a) A (2%) = 0. 

Therefore, applying Lemma 2: 

k i — 1 



a{r) = (a ® 0 [( A ^ ^ (2*'’) } 

i=i j=i 
k 

- '( A (2^d)) A (2'^+ia) A (2'=+V)‘ 



i=i 



mod 2^. 



Annex 2: Proof of Theorem 2 

We begin by the following elementary result: 

Lemma 3 For any z and S, the following identity holds: 

z + 5 = z — 5 mod 2^ . 

Proof of Lemma 3: It is easy to see that, for any A, 

A + A + 1 = 0 mod 2^. 

Applying this identity successively with X = z — 5 and A = z, we obtain: 
z — 5 = —{z — (5) — 1 = —{{—z — 1) — (5) — l = z + i5 mod 2^. 
Proof of Theorem 2: We first apply Lemma 3 with z = A and 5 = r: 



A + r = A — r 



Moreover, 



A = A® (-1) = {{A © r) © (-1)) ©r = A©r©r. 



Hence 



A + r={A®r®r)-r = <L^^{r). 
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From Theorem 1 (with A® r instead of x'), we know that: 

K-l i-l 

^ ^ ® © [( A (2' {A © r)^ A (2*(A © r)) A (2V)j , 

i=i i=i 

so that 

K-i i-l 

A + r = A® r ® 0 [(/\(2^(A©r))A(2M))A(2V)]. 

i=i j=i 

Let us denote, for any integer k > 0, 

k i-l 

Uk = @[{^{2^{A®r))^{TA))^{Tr) . 

i=i j=i 

From the definition of Uk^ we have uq = 0 and A + r = A® r ® uk-i- Moreover 
for all k > 0, 

fc+l i-l 

'^k+1 ~ 0[(/\(2^(A©r))A(2M))A(2V)] 

i=i j=i 
fc+l i-l 

= 2(A A r) © 0[(/\(2^(Al©r))A(2M))A(2V)], 

i=2 i = l 

SO that, if we denote i' = i — 1 and j' = j — I- 

k i' — 1 

Uk+i = 2(A A r) © © [( A (A©r)) A (2*'+M)) a (2*'+V)j 

i'^1 j'^0 

k i' — l 

= 2{(^Ar)©0 [( /\ (2^'{A®r)) A (2*'^)) A (2*'r)] } 

i'^1 j'^0 

k i' — l 

= 2|(^ A r) © (^ © r) A © [( A (A©r)) A {2^' A)) A (2*'r)j | 

i'=i j'=i 

= 2[{AAr) © (A©r) Auk]. 

Annex 3: Proof of Theorem 3 

Let S be any value in I. We begin by proving that Os is surjective. 

Let y £ I. If we denote: 

^=0 [(A(2^''^))^(2V)' 

i=0 j=l 
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0 

(conventionally, the empty product /\ equals the identity element of 

1=1 



operator), we have: 



the A 



K-l i 

7 © (27) A 5 = 7 © ©[(A(2'S))a(2-«„)], 

j—0 

so that, if we denote i' = i + 1 and j' = j + 1: 

K i' 

7 © (27) A (5 = 7 © 0[( A ( 2 ^'-^< 5 ))A( 2 *'y)]. 

i'=i j'=i 

From the definition of 7, it is easy to see that: 



© [( A (2''-''S)) A (2’V) 



7©y 



Therefore: 

7 © (27) AS = y. 

We have proven that, for any y £ I, a, value j £ I exists such that Osij) = y- 
As a consequence, 0s is surjective. Since it maps / onto itself, we deduce that 
0s is bijective. 
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Abstract. Although tamper-resistant devices are specifically designed 
to thwart invasive attacks, they remain vulnerable to micro-probing. 
Among several possibilities to provide data obfuscations, keyed hard- 
ware permutations can provide compact design and easy diversification. 
We discuss the efficiency of such primitives, and we give several examples 
of implementations, along with proofs of effectively large key-space. 

Keywords. Tamper-resistance, Probing attacks. Data scrambling. Keyed 
permutations, Smart-cards. 



1 Introduction 

Microprobing techniques are invasive attacks consisting in introducing a con- 
ductor point into certain parts of a tamper-resistant chip to monitor the elec- 
trical signal at this spot [3,1], in order to extract some secret information. A 
natural means to thwart these attacks consists in encrypting the data stored 
or exchanged inside the chip. Using classical block-ciphers like DES provides a 
natural solution, but this method becomes quickly illusory when the concerned 
data transit through highly time critical processes, like for example the com- 
munication between the microprocessor and the RAM. In this case, more hasty 
techniques must be used to provide very fast processing at the expense of a 
lower, but acceptable security level. This category of techniques is usually and 
informally called scrambling, or obfuscation, as opposed to encryption [4]. 

A popular primitive for scrambling in highly constrained environments con- 
sists simply in bit permutations, these permutations being parameterized by a 
key. As it appears in what follows, such functions result in very compact designs, 
where only one cycle is needed to process the data. Furthermore, a large number 
of permutations can be generated, with a one-to-one correspondence with the 
key space. Ultimately, keyed permutations can be easily used in more complex 
functions which require some keyed linear components. 



Q.K. Kog, D. Naccache, and C. Paar (Eds.): CHES 2001, LNCS 2162, pp. 16-27, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 
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More precisely, this paper addresses the problem of designing keyed permu- 
tations of compact shape, that generate a large set of permutations when the key 
runs over the key space, and that offer good properties against chosen plaintext 
attacks in the context of physical probing. This combinatorial issue is tractable 
for a small number of bits, but becomes more intricate for realistic values like 
16 or 32, which brings intrinsic interest to the results of Section 3. The rest of 
this paper is organized as follows. Section 2 defines a security model for scram- 
bling functions, and proposes a criterion for the design of keyed permutations. 
Section 3 is the main part of our paper. Three different constructions for keyed 
permutations are proposed, along with proofs of some of their properties. Hard- 
ware engineers interested in quickly evaluating the practical contribution of this 
paper can directly jump to Section 4, which contains some numerical data about 
our new keyed permutations. Some possible applications are also listed. An ex- 
ample of a very fast on-chip data scrambler which integrates keyed permutations 
is proposed. 

2 Scrambling Functions and Probing Attacks 

2.1 Security Model 

We consider the context of a smart-card microprocessor, which communicates 
with the RAM. The memory, and the channel which links it to the micropro- 
cessor, are subject to probing attempts. Consequently, to prevent information 
disclosure, a data word (6 q, • ■ • , bn-i) is encrypted with a key K using a scram- 
bling function Ck before being sent to the memory. The key K may be refreshed 
each time the card is reset, but might also be regenerated more often, using mul- 
tiple keys encryption techniques. We assume that the attacker is allowed to play 
with the microprocessor, which implies that he can send any data he wants 
to the memory. His goal is to decipher a secret data present in the card, read 
from the RAM at some time. The difference with a classical chosen plaintext 
attack on a block-cipher is that the attacker has only a partial knowledge of 
the ciphertext. Indeed, probing attacks are usually not easy to mount, and in 
particular, the attacker might rarely probe wherever he wants[3]. Consequently, 
we restrict the capabilities of the attacker to recovering only some of the bits 

(&o, . . . , 6'„_i) = Cxibo , . . . , bn-i). 

2.2 A Security Criterion for Linear Functions 

For efficiency reasons, it is practical to choose for Ck a linear function. This 
choice does not provide any security against full chosen plaintext attacks, but 
might be sufficient if we assume that the attacker knows very few bits 6'. One of 
the possible strategies of the attacker to decrypt a secret data might be to recover 
completely K during a preliminary phase when several plaintext messages are 
sent to the scrambling function. In this context, we can quantify the security 
provided by Ck by determining the number of wires that the attacker has to 
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be able to probe simultaneously to recover the key. In particular, when Ck 
is a permutation uk of the group of the permutations of {0,...,n — 1}, 
this question boils down to: what is the minimal number of pairs the 

attacker needs to know to recover K entirely ? To formalize this condition, we 
introduce some definitions and notations. If /i and cr are two elements of S'„, we 
denote by /icr the permutation defined by i i— >■ ^{a(i)). We also denote by l the 
permutation such that i{i) = i for all i G {0, . . . , n — 1}. 

An (n, k)— keyed permutation is a map from the set {0, 1}^ to S'„ : 

a : {0, 1}'= ^ 

K I — > aK ■ 

The degree of freedom of an (n,fc)— keyed permutation is the smallest integer 
TO > 1 such that there exists an (to + 1)— tuple (ii, . . . , im+i) of pairwise distinct 
elements of {0,...,n — 1}, such that the map 

{ax/K G {0, 1}'=} ^ {0, . . . , n - l^+i 

^ (o’Ar(tl), . . . , CT/y(tm+l)) 

is injective. Informally, the degree of freedom is equal to the minimum number 
of pairs (i, crj^^i)) we have to fix to determine uniquely aK- Note that this does 
not mean that this suffices to determine K, as the map from {0, 1}^ to Sn might 
not be injective, but in our context, the secret key is completely recovered as 
soon as ax is known. From a practical standpoint, this definition implies also 
that we should look for keyed permutations with a degree of freedom as high as 
possible. 

For example, in the strongest case, if a is surjective in S'„, then a has degree 
of freedom n — 1 : we need exactly n — 1 distinct pairs (i^axii)) to determine 
completely uk (the missing value is infered from the n — 1 others, since ax is 
a bijection). For the weakest case, let /r yf r be in Sn, and consider the keyed 
permutation a^ from {0, 1} to such that Oq = l and a\ = fj,. Then a as degree 
of freedom one: as /r yf t, there exist ii such that ai,{ii) yf ii iff 6 = 1. 

3 A Recursive Construction 

3.1 Outline of the Result 

This section explains the construction of three different (n, A)— keyed permuta- 
tions when n is a power of two. These three constructions can be realized using 
combinatorial logic, and the corresponding circuits are of depth log 2 n. Conse- 
quently, they achieve a very compact shape, and very short propagation delay 
features. 

The construction of Section 3.3 generates 2"“^ permutations, which are in 
one-to-one correspondance with the key space. This construction is improved in 
Section 3.4, where we generate permutations. Section 3.5 still improves this 
result by generating at least permutations, with a = (log 2 6)/4 « 0.65 

and (3 = (log 2 6)/4 — 1/2 « 0.15. Furthermore, we prove that the last two 
constructions have degree of freedom at least n/2 — 1. 
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Fig. 1. Hardware realization of a switch 



3.2 Hardware Representation of Keyed Permutations 

The most natural approach to design hardware permutations is to use the set of 
the transpositions of 5'„. We recall that a transposition is an element (i,j) of S'„ 
which exchanges the symbol i with the symbol j. A well-known fact is that every 
permutation on can be expressed as a product of transpositions. If t is a transpo- 
sition, a keyed permutation b with one bit of key can be realized using two 
parallel multiplexers. We call such a block a switch. A hardware realization of 
a switch is given in figure 1. Oriented graphs provide a compact representation 
of switch based circuits. For example, figure 2 represents the keyed permutation 
( 6 o,^i) '->■ (1, 3)^“(0, 1)^1 G 54 . The grey nodes correspond to the switches, and 
are commanded by additional key wires, which do not appear on the figure. In 
the following, the depth of a circuit will refer to the number of stages composing 
the circuit, this number being related to a switch-based design. Note that the 
switch-depth is less than or equal to the multiplexer-depth. 




3.3 A Group Theoretic Construction 

We denote by ^ greatest subgroup of Sn which order is a power of two. 
is called a Sylow 2— subgroup of S'„. In all the following, we will suppose that n 
is a power of two. In this case, TL 2 has order 2"“^ [5]. A set of generators of 
can be constructed recursively as follows. Consider the set {0, . . . ,n — 1}, and 
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the permutation g G "HJ which exchanges {0, . . . , n/2 — 1} with {n/2, . . . , n — 1} 
by i -fG t + n/2. Now, we can repeat inductively this procedure by considering 
the sets {0, . . . , n/2 — 1} and {n/2, . . . , n — 1}. We get finally n — 1 elements of 
Sn which generate 

These generators are very easy to implement in hardware, since permuting 
two sets of k bits can be done using k switches parameterized by the same bit 
of key. This yields to a (n, log 2 n — 1)— keyed permutation, that can be realized 
using (nlog2n)/2 switches. Figure 3 summarizes schematically this recursive 
construction. 




n 



Fig. 3. Recursive contruction based on Sylow 2— subgroups 



An interesting property of this design is that the set of generated permuta- 
tions forms a group. We can take advantage of this fact to increase the number 
of generated permutations: In our previous construction, we built a hardware 
design which realized the keyed permutation gx, where gx takes all the values 
of 7^2 when K runs over the key space. Denote by p a well-chosen permutation, 
which we implement in hardware, that is, by permuting physically the wires. 
Then, by reusing the previous construction, we can realize the keyed permuta- 
tion 

S{Ki,K2) = 9Ki o po gK2 , 

which should generate more permutations. The question is to determine how 
many permutations are effectively generated by this method. It is easy to see 
that no collisions appear (i.e. the number of generated permutations is equal to 
|"H 2 P) iff tff® following algebraic condition is verified: 

pnqp-^nn^ = {i} . (i) 

The naive complexity of checking if a given permutation p verifies (1) is equal to 
|"H 2 I = 2"“^. Consequently, our approach fails as soon as say n > 32, since this 
last verification has to be made 32!/2 times on average before finding a solution. 
Nevertheless, for n = 32, we may still get a result using the following trick: 
we define H, the subgroup of which preserves (0, . . . , 15} and (16, . . . , 31}. 
H is isomorphic to x and has cardinality 2^°. Consider the keyed 
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permutation (Ki,K2) e- >• hKiPgK 2 J where hxi runs over H, runs over 
and where /3 is a fixed permutation. This map is injective iff 

= ( 2 ) 

A simple method to find such a /3 is first to solve (1) for n = 16, and then to set 
[3 = (p(-),p(- — 16) + 16). This search terminates on average after {\S\q/'H^\ ■ 
\T~0f\Y^'^ « 2^^ trials. For instance, the following permutation is a solution of 
(1) for n = 16: 



p = (0, 15, 9, 10, 11, 12, 13, 14)(1, 2, 3)(4, 5, 6, 7, 8) . 

The resulting number of generated permutations is equal to 2^° • 2^^ = 2®^. 



3.4 Generalization of the Group-Based Design 

Unfortunately, this improvement works only for n < 32. Furthermore, we gener- 
ate only 2®^ permutations, among the 32! « 2^^® elements of S'32. Nevertheless, 
as we will see, a slight modification of the set of generators leads to generating 
a much larger subset of S'„. The price to pay for this improvement is to lose 
the group property, but this has no impact for our application. As before, the 
solution is built recursively by induction on log2 n, so that at each step of the 
induction, we add a new stage to the corresponding circuit. 

Theorem 1. If n is a power of two, then there exists a circuit of depth log2 n 
involving (nlog2n)/2 switches, which realizes a {n, (jilog 2 n) /2)— keyed permu- 
tation S. Furthermore, the number of distinct generated permutations is equal to 
the number of keys, that is n"/^. 

Proof We proceed by induction. Let ct be a (n/2, ((n/2) log2(n/2))/2)— keyed 
permutation with the properties stated in the theorem. For convenience, we set 
k = ((n/2) log2(n/2))/2. First, we defined the (n,2fc)— keyed permutation p. as 






o'Kiii) if 0 < z < n/2 

(Ti<-2 ~ ''^/2) + ^/2 if nj2<i <n , 



(3) 



where K\,K 2 G {0,1}*. Let set k' = 2k n/2, and define the (n,fc')— keyed 
permutation 5 by 

5(Ki,K2,E) = Ee ° F(Ki,K2) ) 



where 



ee = 



n/2-1 

n 

1=0 



■n/2Y 



and E = (cq, . . . , e„/2_i) G (0, 1}”/^. First, k' = 2k -\- n/2 = n(log2 n — l)/2 -I- 
n/2 = (nlog2n)/2, which corresponds to what we expect. Furthermore, the 
number of switches used for realizing S is equal to 2k -\- n/2 = k' . It remains 
to prove that 5 is injective. This comes from the fact that for all 0 < z < n/2. 
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K 2 iff 6i = 0. Consequently, we can uniquely recover E from 

S{Ki,K 2 ,e)- As fj, is injective, we can also recover uniquely Ki and K 2 from 
= ^E^ ° ^{Ki,K 2 ,e)j which concludes the proof. □ 

Figure 4 represents an example of this construction for the case n = 8. 




As motivated in section 2.2, we want to check that S has high enough degree 
of freedom. This is guaranteed by the following result : 

Theorem 2. The degree of freedom ds of the keyed permutation S of theorem 1 
verifies 

ds > n/2 — 1 . 

Proof. We proceed by induction on n. When n = 2, the theorem is true, as 
in order to guess the state of the switch, we have to know at least one pair 
{ijSxij))- Suppose that the theorem is true at step n/2. Recall that with the 
notations of theorem 1, S is given by the recursion formula 

Sk=(Ki,K2,E) = Ee O H{Ki,K2) 5 

where /i is defined from the (n/2, fc)— keyed permutation cr following (3). The 
induction formula implies that a has degree of freedom n/4 — 1. We set r = 
n/2 — 1, and we choose an r— tuple (ti, . . . ,ir) of pairwise distinct elements of 
{0, . . . , n — 1}, a key K = {Ki,K 2 , E), and we set ji = SK{ii). Consider the set 
h = {I/ee^Ui) < n/2}, and /2 = {l/E]^^{ji) > n/2}. As I/ 1 I + I/ 2 I = n/2-1, one 
of the two sets (for example Ii) has strictly less than n/4 elements: |/i| < n/4— 1. 
Furthermore, as preserves {0, . . . , n/2 — 1} and {n/2, . . . , n — 1}, for all I £ Ii, 
i[ < n/2. Consequently, there exists K[ yf Ki such that 

yi £ Ii, o-Kiiii) = csE[{ii) ■ 

This implies that 

^{Ki,K2,E){il, ■ ■ ■ ,ir) = S(K[,K2,E)ih, ■ ■ ■ ,ir) ■ 

This proves that 5 has degree of freedom greater than n/2 — 1. □ 
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3.5 Further Improvements 

We may still try to improve the previous construction by modifying it so that 
we could generate a larger set of permutations. We always consider the keyed 
permutation S as defined above, with the recursion formula 

Sk=(Ki,K2,E) = VeO ■ 

Define the vector e = (eg, • ■ • , £n-i) as e* = < n/ 2 }, where 1 {F} is 

equal to one when the predicate P is true, and to zero otherwise. As underlined 
in the proof of theorem 1, there is a one-to-one correspondence between the 
set of all the keys E and the set of all the n/ 2 — tuples (eg, • • • , en/2-i)- This 
is because ^ preserves the segments i < nj 2 and i > nl 2 . This means that 
e contains twice too much information. Consequently, our idea is to group the 
transpositions (j, j -I- n/ 2 ) of ^ two by two, and to compose them with a cycle 
of the four concerned elements, so that we could still invert our map thanks to 
the associated 4 -tuple of bits e^. 

Consider first the set { 0 , 1 , 2 , 3 }, and the map 

g : {bo,h,b2) G { 0 , 1}3 ^ ( 0 , 2 , 1 , 3 )^“( 0 , 2 )'H 1 , 3 )*'^ . 

We consider also the map defined by 

h{bo,bi,b 2 ) = {l{g{bo,bi,b 2 )~^{i) < 2})g<.<3 . 

The truth table of h is given below: 



(6g, 6i, 62) 


h{bo, bi, 62) 


(0,0,0) 


(1,1,0,0) 


(0,0,1) 


(1,0,0,1) 


(0,1,0) 


(0,1,1,0) 


(0,1,1) 


(0,0,1,1) 


(1,0,0) 


(0,0,1,1) 


(1,0,1) 


(1,0,1,0) 


(1,1,0) 


(0,1,0,1) 


(1,1,1) 


(1,1,0,0) 



As it can be seen, h{{ 0 , 1 }^) has cardinality six. 

Now, group the points of { 0 ,...,n — 1 } four by four (when n > 4 ): 

^0 = (Oj 1) ^/2, n /2 + 1), 

Ai = ( 2 , 3 , n /2 -|- 2 , u /2 -|- 3 ) , 



= (n/2 — 2, n/2 — 1, n — 2, n — 1) . 

Finally, form the cycles cg, . . . , c„/4_i, with support respectively equal to the 
Ai, obtained from the cycle ( 0 , 2 , 1 , 3 ) by applying for each 0 < t < n /4 the 
substitutions 



0 I— i, 1 I— i -I- 1, 2 I— >■ n/ 2 -I- t, 3 I— >■ n/ 2 -I- i -I - 1 . 
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We are now ready to build recursively a keyed permutation y, with the same 
method as in the proof of theorem 1. Following analogous notations, we define 
X(KuK2,e,f) inductively as 



X(Ki,K 2 ,E,F) — O I/E O J ( 4 ) 

where 

n/4— 1 

«<’= n 4 > 

i=o 

with F = (/o, . . . , /„/ 4 -i) G {0, 

Theorem 3. x o {n,k)— keyed permutation, where k = |nlog 2 n. x can he 
realized using k switches, and has degree of freedom at least n/2 — 1. Furthermore, 
X generates at least an distinct permutations, where an verifies the recursion 
formula 

r a„ = 6 ”/'‘o^/2 ifn>4: 

\ 02 = 2 . 

Proof. We prove the recursion formula, the verification of the other points being 
straightforward. Suppose that we have constructed an (n/2, fc)— keyed permuta- 
tion a that verifies our statements. We denoted by S the largest set of the keys 
such that cr restricted to £ is injective. Referring to the recursive construction of 
X of equation (4), it is clear that restricted to f x f is injective: this is a direct 
consequence of definition (3) of p,. Consider for each 0 < z < n/4 the 3— tuples 
Ui = (c 2 i, C 2 i+i, /i), and the set 

{(0,0,0), (0,0,1), (0,1,0), (0,1,1), (1,0,1),(1,1,0)} . 

Using the truth table of h, we see that h restricted to A is injective. Consider 
the set F of the keys defined by F = |(F,F)/Vz Ui G A). It is clear that 

\T\ = 1^41"/^ = 6”/4 . (5) 

Now, since P(Ki,K2) preserves the sets {0, . . . , n/2 — 1} and {n/2, . . . , n — 1}, we 
have that 

Xk^(z) < n/2 4=^ o < n/2 . 

This implies that x restricted to £ x £ x if is injective. Using equality (5) and 
the fact that \£\ > an/ 2 , this proves the theorem. □ 

It is easy to check by induction that log 2 a„ > (n log 2 n) /2, which means that 
X generates effectively more permutations than 6. The explicit expression of a„ 
announced in section 3.1 results from the fact that the sequence (log 2 a„)/n is 
in arithmetic progression. 

Contrary to 5, the distribution of the permutations generated by xk, when 
K is chosen uniformly, is not uniform. We leave open the question of determining 
exactly this distribution. Anyway, it is easy to reduce the key space so that the 
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restriction of x becomes injective. For that, it suffices to restrict the keys {E, F) 
to the set T defined in the proof above, and to proceed by induction. We leave 
open the question of the exact distribution of the generated permutations. 

The practical realization of x implies to design in hardware a keyed permu- 
tation with a cycle of length four, like (0,1,2,3)*',6 g {0,1}. This can be easily 
done in one stage using four multiplexers 

Figure 5 shows a realization of x for the case n = 8. The nodes with 8 edges 
represent the cycle (0, 2, 1, 3) involved in the construction of x- 




4 Practical Examples and Applications 

4.1 Numerical Examples 

Table 1 shows the characteristics of the two keyed permutations b and x for 
various values of n = 8, 16,32,64. The number of multiplexers needed for their 
construction is denoted by fV^ux) and the number of distinct generated permu- 
tations is denoted by iVperm- 



Table 1. Characteristics of b and x for various values of n 





b 


X 


b 


X 


b 


X 


b 


X 


n 




8 


16 


32 


64 


mux 


32 


40 


88 112 


224 


288 


544 


704 


Depth 


3 


6 


4 8 


5 


10 


6 


12 


Key size 


12 


16 


32 


44 


80 


112 


192 


272 


[log 2 A^perm] 


12 


14 


32 


39 


80 


99 


192 


239 


[log 2 n!] 




15 


44 


118 


296 



An important fact is the very small number of stages needed to implement 
b and x- For example, for n = 32, the design has only five levels of gates. 
This property makes these functions particularly suitable for data scrambling in 
critical pathes. Non exhaustive applications are: scrambling of the bus between 
the microprocessor and the memory, scrambling of the RAM, or scrambling of 
the bus between the CPU and the cryptoprocessor. 
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4.2 Protecting the Secrecy of the Design 

These functions can also easily be diversified, and thus provide a customizable 
design, so that the final scrambling function can remain secret. Recall that 6 is 
built recursively from the equation 

S(Ki,K2,E) = O ■ 

This definition would correspond to the “normal form” of our construction. 
Derivated forms can be obtained as follows: at each step of the induction, we 
choose two permutations oi, a2, acting respectively on {0, . . . , n/2 — 1} and on 
{n/2, . . . ,n — 1}, and we implement these permutations in hardware, that is, 
we permute physically the wires of the circuit. Here, a\ and «2 are supposed to 
be kept secret. With the same material, we can now build 5 using the modified 
equation 

S(Ki,K 2 ,e) = ee o Oil o a 2 o . 

It is not difficult to see that we generate mutatis mutandis the same number of 
permutations as before, and that the resulting keyed permutation has the same 
degree of freedom. It suffices for this to rewrite the proofs of theorems I and 2. 
The same construction can be applied to y, with the same consequences. 

4.3 Non-linear Data Scrambling Using Keyed Permutations 

The primitives that we have just described can easily be incorporated into more 
complex non-linear data scrambling functions. One major advantage of the pro- 
posed constructions is the large size of the key-space and of the resulting func- 
tion space. Besides, the very compact shape of the resulting circuits allows to 
use them several times in more complex functions. 

Following Shannon’s basic confusion-diffusion paradigm, these keyed permu- 
tations can be used in alternating layers with small, say 4 bit to 4 bit substitution 
boxes (S-boxes). Clearly, such constructions cannot achieve the same security 
level as classical block ciphers do : following Shamir’s security analysis [6] , a five 
layer SASAS construction using alternating layers of S-boxes and affine functions 
(of which permutations are a special case) can be broken using approximately 
2^® chosen plaintexts for 128 bit blocks and 8-bit to 8-bit Sboxes. 

However, this kind of construction still yields a sub-exponential security 
bound instead of a linear security bound in terms of chosen plaintext attacks. As 
the attacker has quite limited resources in the probing setting anyhow, bearing 
in mind that she is not able to probe more than a handful of wires simultaneously 
using the same session scrambling key, a limited number of layers of additional 
key-dependent S-boxes will sufficiently increase the difficulty of unscrambling 
the memory and bus contents in the context of tamper-resistant objects such as 
smart-cards. 

In terms of circuit complexity, a 4 bit to 4 bit S-box can be efficiently imple- 
mented using an average of 32 gates with a circuit depth three (in a completely 
optimized architecture this depth may become as low as one). Thus for the ex- 
ample SASAS structure for a 32 bit input size, each substitution layer adds 
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approximately 256 gates to the 224 gates of the proposed keyed permutation. 
With five layers altogether, the circuit has around 1200 gates for a depth of 19. 
It is then left to the designer to select whatever circuit complexity is acceptable 
in the concerned architecture compared to the obfuscation level fit for purpose. 

5 Conclusion 

We proposed three implementations of keyed permutations, which achieve very 
short depth, and effectively large key space. We indicated also a criterion to 
identify keyed permutations with good properties against chosen plaintext at- 
tacks realized by probing. These functions are particularly well suited for data 
obfuscation in very constrained environments like smart-cards. 
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Abstract. Techniques such as DPA and SPA can be used to find the 
secret keys stored in smart-cards. These techniques have caused concern 
for they can allow people to recharge their stored value smartcards (in 
effect printing money), or illegally use phone or digital TV services. We 
propose an addition to current processors which will counteract these 
techniques. By randomising register usage, we can hide the secret key 
stored in a smartcard. The extension we propose can be added to existing 
processors, and is transparent to the algorithm. 



1 Background 

Modern cryptography is about ensuring the integrity, confidentiality and authen- 
ticity of digital communications. As such it has a large number of applications 
from e-commerce on the Internet through to charging mechanisms for pay-per- 
view-TV. As more and more devices become network aware they also become 
potential weak links in the chain. Hence cryptographic techniques are now be- 
ing embedded into devices such as smart cards, mobile phones and PDA’s. This 
poses a number of problems since the cryptographic modules are no longer main- 
tained in secure vaults inside large corporations. For a cryptographic system to 
remain secure it is imperative that the secret keys used to perform the required 
security services are not revealed in any way. 

The fact that secret keys are now embedded into a number of devices means 
that the hardware becomes an attractive target for hackers. For example if one 
could determine the keys which encrypt the digital television transmissions, then 
one could create decoders and sell them on the black market . On a more serious 
front if one could determine the keys which protect a number of stored value 
smart cards, which hold an electronic representation of cash, then one could 
essentially print money. 

Since cryptographic algorithms themselves have been studied for a long time 
by a large number of experts, hackers are more likely to try to attack the hard- 
ware and system within which the cryptographic unit is housed. A particularly 
worrying attack has been developed in the last few years by P. Kocher and col- 
leagues at Cryptography Research Inc., see [6] and [7]. In these attacks a number 
of physical measurements of the cryptographic unit are made which include 
power consumption, computing time or EMF radiations. These measurements 
are made over a large number of encryption or signature operations and then. 
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using statistical techniques, the secret key embedded inside the cryptographic 
unit is uncovered. 

These attacks work because there is a correlation between the physical mea- 
surements taken at different points during the computation and the internal 
state of the processing device, which is itself related to the secret key. For ex- 
ample, when data is loaded from memory, the memory bus will have to carry 
the value of the data, which will take a certain amount of power depending on 
the data value. Since the load instruction always happens at the same time one 
can produce correlations between various runs of the application, eventually giv- 
ing away the secret of the smart card. The three main techniques developed by 
Kocher et. al. are timing attacks, simple power analysis (SPA) and differential 
power analysis (DPA). It is DPA which provides the most powerful method of 
attack, which can be mounted using very cheap resources. 

Following Kocher’s papers a number of people have started to examine this 
problem and propose solutions, see [1], [2], [3] and [8]. Goubin and Patarin [3] 
give three possible general strategies to combat DPA type attacks: 

1. Introduce random timing shifts so as to decorrelate the output traces on 
individual runs. 

2. Replace critical assembler instructions with ones whose signature is hard to 
analyse, or reengineer the crucial circuitry which performs the arithmetic 
operations or memory transfers. 

3. Make algorithmic changes to the cryptographic primitives under considera- 
tion. 

In [9] May, Muller and Smart propose a method for introducing highly ag- 
gressive randomised execution into a conventional processor. They argue that 
this produces a great deal of temporal misalignment of traces, which can help 
defeat DPA. The methodology is to take standard techniques from the design 
of super-scalar architectures and replace parallel execution with random execu- 
tion. They call this new processor architecture NDISC for Non- Deterministic 
Instruction Stream Computer. 

This defence essentially combines all three of the above defences in that it 
adds considerable timing shifts to the instructions, it introduces circuitry which 
is hard to analyse and essentially makes algorithmic changes to the program “on 
the fly” . 

In this paper we expand on this philosophy by proposing a technique which 
allows the non-deterministic altering of the register to register or memory to 
register transfers. As such this produces a defence more akin to the second of 
the proposed defences above. This new defence can be implemented using a 
very small number of changes to the underlying processor and is completely 
transparent to the algorithm. 

In super-scalar architectures, see [4], [5] or [10], it is standard practice to im- 
plement a form of register renaming. This allows the processor to schedule more 
instructions in parallel. If such a system was implemented in the NDISC proces- 
sor then the instruction stream could be executed in an even greater randomised 
order. However, if the register renaming was performed in a randomised, rather 
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than the standard deterministic manner, then one would obtain the extra effect 
of the power consumed by each register write operation would be different from 
one run to the next. 

The concept of randomised register renaming can be implemented in a con- 
ventional processor to obtain some defence against DPA, but when implemented 
in an NDISC processor we expect the overall defence against DPA to be greatly 
enhanced. Hence, we first present an introduction to what an NDISC processor 
is. 

2 NDISC Processors 

In order to prevent attacks based on correlating data, we have designed a simple 
addition to standard processors that randomises instruction issuing [9]. Cru- 
cially, an attack works because two runs of the same program give comparable 
results; everything compares bar the data. By changing the data even slightly 
the attacker will get a knowingly different trace, and by correlating the traces, 
one builds a picture of what is happening inside the processor. 

An NDISC processor removes correlation between runs, thereby making the 
attack much harder. A conventional processor executes a sequence of instructions 
deterministically; it may execute instructions out-of-order, but it will always 
execute instructions out-of-order in the same way. If the same program is run 
twice in a smart card, then the same instruction trace will be executed. By 
allowing the processor to at run time choose a random instruction ordering, we 
get multiple possible traces that are executed. 

2.1 Random Issuing 

In single pipeline processors a sequence of instructions is executed in the order in 
which they are fetched by the processor. There is a little out-of-order execution 
to help with branch prediction but this all occurs on a very small scale. On 
multiple pipeline processors there are a number of execution units through which 
independent instructions can be passed in parallel. For example, if a processor 
has a logic pipeline and an integer-arithmetic pipeline, then the following two 
instructions 

ADD RO, Rl, R1 

XDR R4, R5, R5 

may be executed in parallel in the two pipelines. One pipeline will execute the 
ADD, the other will execute the XOR. 

Our idea is the following: like a superscalar we identify instructions that can 
be issued independently, but instead of using this information to issue instruc- 
tions in parallel, we use this information to execute instructions out-of-order, 
where the processor makes a random choice as to issue order. We call this pro- 
cess Instruction Descheduling. This creates a level of non-determinism in the 
internal workings of the processor. This is illustrated in Figure 1. 
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Non-Deterministic Computing 
ADD R0,R1 ,R1 or XOR R4, R5, R5 

J 

XOR R4, R5, R5 or ADD R0.R1 ,R1 

i 

Fig. 1. Simple comparison of how a Non-deterministic processor executes two instruc- 
tions as opposed to other processors 



Single Pipeline Processor Two Pipeline Processor 

ADD R0,R1 ,R1 ADD RO^RI^^T ^~^^~)5r^4,R5,R5 

i 

; XOR R4,R5,R5 



The reduction in the effectiveness of DPA results from the fact that the power 
trace from one run will be almost completely uncorrelated with the power trace 
from a second run, since on the two runs different execution sequences are used 
to produce the same result. 

Instruction descheduling means that at run time the processor will select, at 
random, an instruction to execute, thereby randomising the instruction stream, 
and randomising the access pattern to memory caused by both data and instruc- 
tion streams. 

A full description of how an NDISC machine can be implemented is discussed 
in [9], we outline the NDISC architecture here. The set of instructions waiting 
to be executed is held in a block called an issue window. The random instruc- 
tion selection unit randomly selects instructions from the issue window that are 
executable. An instruction is considered executable if it does not depend on 
any result that is not yet available, and the instruction does not overwrite any 
data that is still to be used by other instructions that are not yet executed, or 
instructions that are in execution in the pipeline. 

The implementation of this closely follows the implementation of multi-issue 
processors. There is a block of logic that determines conflicts between instruc- 
tions, resulting in a set of instructions that is executable. From this set we select 
an instruction at random. Given a random number generator, which will nor- 
mally be constructed from a pseudo random number generator that is reseeded 
regularly with some entropy, we select one of the executable instructions and 
schedule it for execution. 

3 Random Register Renaming 

3.1 Basic Register Renaming 

Register renaming is a common technique used to improve the performance of 
processors. Renaming works by defining a set of virtual register identifiers (which 
are used in the instruction set of the processors) and a set of physical registers 
(which are used in the execution unit). At any moment in time, each virtual 
register identifier is associated uniquely with a physical register identifier. 

This binding is unique at any time, but the strength of renaming is that it 
changes with time. Any time that a virtual register is overwritten, the binding 
between the virtual and physical register can be severed, and a fresh physical 
register can be assigned to the virtual register. The reason that this increases 
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performance in a standard processor is that this physical register can be used 
immediately for storing new values, whereas the old physical register can still 
hold values that are used by instructions which are still in execution. This allows 
out of order parallel execution. 

As an example, consider the following bit of code which implements the two 



assignments A=B ; 


; C=D ;: 




LOAD 


B, RO 




STORE 


RO, A 




LOAD 


D , RO ; 


RO is overwritten 


STORE 


RO, C 




Upon execution of the code the 


third instruction will overwrite the value of 


register RO with the value of D. 


This instruction can only be executed if the 



previous store instruction has been completed. If the registers are renamed so 
that we use R1 instead of RO, we would get the following code: 



LOAD B , RO 

STORE RO , A 

LOAD D, R1 ; R1 is overwritten 

STORE Rl, C 



In this code segment the third line of code can be executed in parallel with the 
first two, speeding up the processor. Static register allocation (by the compiler) 
does not achieve the same effect as register renaming for two reasons. First with 
register renaming one needs less bits in the instruction set to encode registers. 
Second, register renaming works at run time; register assignments are based on 
which instructions are in progress, and which are waiting for execution. 



3.2 Random Register Renaming 

We employ Random Register Renaming in order to weaken a DPA attack. Our 
observation is that a large fraction of power trace is produced by overwriting reg- 
ister values. Each time a value of a register is overwritten, the power consumption 
is related to the number of bits that are flipped. We are going to rename regis- 
ters non deterministically; that way, any time that a register is overwritten, it 
overwrites a non predetermined value, randomising the power trace. 

As an example, consider the following code: 

LOAD B,R0 
STORE R0,A 
LOAD D,R0 
STORE R0,C 

Where we suppose that we have just one virtual register identifier (RO) and we 
have two physical registers (RegO, Regl). When this piece of code is executed 
there will be four different ways to execute it: 

LOAD B,RegO LOAD B.RegO LOAD B,Regl LOAD B.Regl 
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Instruction fetched from memory 



Fig. 2. Random register renaming for simple processor 

STORE RegO,A STORE RegO,A STORE Regl,A STORE Regl,A 

LOAD D,RegO LOAD D.Regl LOAD D,RegO LOAD D.Regl 

STORE RegO,C STORE Regl.C STORE RegO.C STORE Regl.C 

Each of these execution traces has its own individual power-trace. Indeed, for 
longer instruction sequences the number of possible traces grows exponentially, 
and combined with NDISC execution [9], we can attain high levels of protection. 

4 Implementation 

4.1 Basic Implementation 

In order to implement random register renaming we distinguish between virtual 
register identifiers (these are the registers as specified by the instruction set), 
and physical registers (which are the ones used by the processor). We assume 
that the number of physical registers is greater or equal to the number of virtual 
register identifiers. In particular we assume that there are 2^ virtual register 
identifiers and 2^ physical registers, where P is larger than V. 

An instruction is fetched into the processor and preprocessed to rename its 
registers. The registers are renamed using a Virtual to Physical mapping table. 
This process is illustrated in Figure 2. The register mapper maintains a mapping 
from virtual identifiers to physical registers, this can be seen as an array of 
register values. A series of bits called used maintains whether a physical register 
is at present in use. 

Each instruction has a number source and destination operands. In most 
instructions these operands are virtual register identifiers. On an instruction 
prefetch, the virtual register identifiers of the source operands are mapped onto 
physical registers using the virtual-to-physical mapper. This is done by using 
the virtual register number as an index in the mapping table, and thus locating 
a physical register number. 
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Fig. 3. Selecting a random register from a set of 2 (R-box°) 



The physical register for the destination operand is selected from the set 
of available free physical registers using a run time calculated random value 
as shown in the next section. The virtual register identifier of the destination 
operand was mapped onto another physical register. This physical register is now 
marked as free. The physical register that has been selected as the destination 
operand is marked as used and the the mapping table is updated associating the 
randomly selected physical register with the virtual identifier of the destination 
operand. Note that in a pipelined processor the register will only be marked free 
later; this is common practice in register renaming hardware. When source and 
destination registers have been mapped, the remapped instruction can be passed 
to the issue mechanism. 



4.2 Selecting a Random Register from a Set 

It is essential that in a single clock cycle we extract a free register. Generating 
a random number in a clock cycle is not difficult, but we must only pick from 
the set of free registers. For this purpose, we have devised a random selection 
unit. The random selection unit is a tree structure. Its simplest form consists 
of a single “R-box°” selecting one random element from a set of 2, shown in 
Figure 3. 

This box has two inputs /q and I\ which denote whether register 0 and 
register 1 are currently in use, a random input bit Rg, and an output-bit Aq 
pointing to the randomly selected free register. If Iq and Ii are both zero (i.e., 
both registers are free), then the random bit determines whether the output is 
zero or one. If one of the registers is not free, then the output is fixed to point 
to the other register. 

We now construct a tree of AND-gates, AND-ing the used-bits of each register 
as shown in Figure 4. The output of each of the AND gates is ‘0’ if there is at 
least one free register in the set of registers covered by that AND gate, and ‘F 
if there are no free registers in that set. So, the output of the left-hand top- 
level AND gate is ‘0’ if there is at least one free register in the lower half of 
the registers. The output of the right-hand top-level AND gate is ‘0’ if there is 
at least one free register in the higher half of the registers. We use R-box° to 
determine which of those two halfs of the register set we are going to pick a 
register from. 
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Fig. 4. Tree to reduce free set 




Fig. 5. General R-box 



At the next level down in the tree we have outputs of four AND gates, which 
outputs state whether there are free registers in each of the four quarters of the 
register bank. R-box® has determined whether the free register is picked from the 
top or the bottom half, and R-box^ is going to decide which of the two quarters 
to use. 

We generalise R-box^, and R-box^ from R-box® to have 2^+^ inputs Iq, . . . , 
l 2 k+i, k input-address bits Aq, . . . , A^-i and one output bit Ak, as shown in 
Figure 5. This R-box selects a value for Ak so that bit lAo,...,Ak is zero. The 
selection is based on the assumption that all the higher address bits (Aq ■ ■ . Ak, 
the inputs provided by previous R-boxes) have been selected before in such a 
way that there is an empty register available. The selection process works by 
selecting two bits from the set Iq, - ■ ■ , l 2 >=+^ using Aq . . . Ak-i, and subsequently 
uses the random bit to decide which of those two bits to select. This gives us 
one more address bit, which results in the picking of a single input lAo,...,Ak- 

This architecture works since we can guarantee to have at least one physical 
register free at any point in the program. This solution selects a random bit 
which is set to zero in a constant time; at the cost of the random selection being 
skewed. In particular, if all but one of the bits in one half of the set are one. 
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columns (1 for each register) 




Fig. 6. Register renaming in conjunction with (non deterministic) out of order execu- 
tion 



than this solution will favour that solution with a 50% probability. This can be 
improved by performing a random rotation on the input bits prior to selecting 
a random bit. 



4.3 Random Register Renaming in a Non Deterministic Processor 

Random register renaming can be applied to any processor. For a standard 
microprocessor we have shown that random register renaming requires very little 
alteration to the standard renaming unit found in modern super-scalars. When 
attaching this unit to a NDISC processor [9], one needs to be more careful. For 
NDISC processors we determine the set of free registers as shown in Figure 6. 
A matrix of size 1x2^ (where I is the issue window size) maintains for each 
instruction in the issue window which registers that instruction uses. A one in 
element (i,p) in the matrix indicates that register p will be used by instruction i. 
A zero in element (i,p) indicates that register p will not be used by instruction i. 
When an instruction is stored in location E of the issue buffer, the bits in row E 
associated with its source and destination registers are set to one. An extra row 
at the bottom of the matrix keeps track of which registers are currently mapped 
and could, therefore, be used by future instructions. 
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When an instruction C has been dispatched and completed, the bits for 
row C are reset to zero (an optimisation would be to reset the source registers 
when the source registers have been read by the execution unit, and to reset 
the destination register when the results value has been written). The logical 
OR of all bits in a column r determines whether a register r is in use, a one 
value indicating that the register is used. These values are used by the renaming 
unit in order to determine which register to use in order to remap a destination 
register, as in the previous example of a standard processor. 

5 Conclusion and Future Work 

We have shown how one can use ideas from super-scalar architectures to produce 
a randomised non- deterministic processor. This idea essentially allows a single 
instruction stream to correspond to one of an exponential number of possibly 
executed programs, all of which produce the same output. 

In the case of random register renaming used on its own the actual executed 
program is in the same program sequence but with the source and destination 
registers randomly altered. In the case of using both random register renaming 
and the ideas in [9] we obtain not only a program whose registers have been 
randomly reassigned, but the program sequence is now randomised as well. Since 
attacks such as differential power analysis rely on the correlation of data across 
many execution runs of the same program, the idea of random register renaming 
can help to defeat such attacks. 

Further research still needs to be carried out, yet we feel that the proposed 
solution to differential power analysis gives a number of advantages; such as 
the fact that program code does not need to be modified and neither speed 
nor power consumption are compromised. In the near future we aim to produce 
a demonstration version of an the NDISC processor with the random register 
renaming included in the main execution unit. This demonstration will then be 
tested for resistance to DPA on a number of cryptographic algorithms. 
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Abstract. Power Analysis attacks on elliptic curve cryptosystems and 
various countermeasures against them, have been first discussed by Coron 
([6]). All proposed countermeasures are based on the randomization or 
blinding of the inputparameters of the binary algorithm. We propose a 
countermeasure that randomizes the binary algorithm itself. Our algo- 
rithm needs approximately 9% more additions than the ordinary binary 
algorithm, but makes power analysis attacks really difficult. 

Keywords: Power Analysis, Elliptic Curve Cryptosystems 



1 Introduction 

Elliptic curve cryptosystems (ECC) have attracted much attention since they 
were hrst proposed in 1985 by Miller [24] and Koblitz [15]. The underlying dis- 
crete logarithm problem seems to be much harder than in other groups. Today, 
no subexponential-time algorithm is known for this problem in the case of non- 
supersingular curves. This results in much shorter keylengths for ECC, which 
makes those cryptosystems especially attractive for hardware implementations 
for instance on smartcards. One must consider therefore not only mathematical 
attacks on ECC, but also attacks that exploit weaknesses in the implementation. 

In the last years, attacks have been published that use leaked side-channel- 
information such as the power consumption or timing measurement. These meth- 
ods are all passive, this means that an attacker just needs to monitor the crypto- 
graphic device. The most popular method today, the differential power analysis 
(DPA) was introduced 1998 by Cryptography Research. DPA exploits the in- 
formation drawn from the leakage of power consumption. First, mostly applied 
to symmetric cryptosystems, DPA was then applied successfully on public key 
cryptosystems, see [6], [12] and [20]. Power analysis is a very strong attack. For 
a successful attack on a straightforward DES implementation only a few hun- 
dred measurements are needed. Also the technical effort is comparatively small. 
One just needs a digital sampling oscilloscope with an appropriate sampling 
rate for the power measurements, and a standard PC to process the obtained 
measurement data. The processing of the data itself is also very easy and it is 
not necessary to understand the concept of the attack to perform it successfully. 



Q.K. Kog, D. Naccache, and C. Paar (Eds.): CHES 2001, LNCS 2162, pp. 39—50, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




40 



E. Oswald and M. Aigner 



To make a long story short, this attack can be mounted by not only experi- 
enced cryptanalysts, but by everyone! This fact makes it even more necessary to 
counteract this attack for both private and public key cryptosystems. 

In this paper we deal exclusively with elliptic curve cryptosystems. The coun- 
termeasures that were proposed in the before mentioned articles, rely on ran- 
domizing or blinding the parameters (the elliptic curve point P and the secret 
key k) of the binary algorithm. In our article we present a countermeasure based 
on randomizing the binary algorithm itself. Our method does not only provide 
security against power attacks, but also does not slow down the encryption algo- 
rithm, or require the storage of additional elliptic curve parameters (for instance 
the number of points on the curve) as other methods do. 

The paper is organized as follows. Considering the binary algorithm, we re- 
view the concept of power analysis in section 2. In section 3 we explain the 
method of addition-subtraction chains as a speedup for the standard binary al- 
gorithm. Finally we present our randomization method on the grounds of these 
addition-subtraction chains. 

2 Elliptic Curve Cryptosystems, the Binary Algorithm 
and Power Attacks 

Some public key cryptosystems require the computation of a modular exponen- 
tiation {P = mod p) or a scalar multiplication (P = kM). This is usually 
done by the binary algorithm binalg{P, M, k) which is sketched (in its bottom-up 
version) in the following figure: 



binalg(P,M,k) 

Q = M 

if /co = 1 then P = M else P = 0 
for t = 1 to n — 1 

Q = Q * Q 
if {ki == 1) then 
P = P*Q 
return P 



For validity and explanation see [14]. The * denotes hereby an appropriate 
operation, which can be the multiplication for instance, but also the addition. 

2.1 ECC Basics 

Elliptic curve cryptosystems make use of the binary algorithm for the compu- 
tation of the scalar point multiplication. An elliptic curve over a field K, short 
E{K), is defined as a nonsingular homogeneous cubic polynomial F{xq, xi,X 2 ) € 
K[xq,Xi,X 2 ], provided there is at least one rational point on E{K) [13]. The 
set of rational points can be made into an abelian group in a natural way. If 
Pi,P 2 G E{K), then the line connecting both points intersects the elliptic curve 
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in a third point P 3 . Further, one calls the third point of intersection of the line 
connecting 0 (for example) and P 3 with E{K), the sum of Pi and P 2 - For char 
AT ^ 2 , 3 every elliptic curve can be written as 

XqxI = xf — AxqXi — Bxq, A,B,gK. (1) 

This curve has only one point at infinity, which is the identity element of the 
group. With the transformation x = xijxo and y = x^lxii^ xq yf 0, one gets the 
equation for an elliptic curve in affine coordinates : 

= x^ — Ax — B. (2) 

The point at infinity is lying infinitely far off in the direction of the y axis. Thus 
the inverse of a point P = {x,y) ^ O is —P = (x,—y). Formulas for the point 
addition and point duplication on an elliptic curve defined over a finite field can 
be found for example in [5]. The following tables give a brief overview of the 
different costs of the two operations. I denotes the inversion, M the multiplica- 
tion and S the squaring in K . Conversion from projective to affine coordinates 
is not taken into account. Also more efficient projective representations are not 
included in the tables, see therefore again [5]. 



Characteristic K = 


2 


Operation 


Coordinates 




affine 


projective 


Point addition 


lI+2M-tlS 


15M-b5S 


Point doubling 


H-b2M-KS 


5M-b5S 



Characteristic K > 3 


Operation 


Coordinates 




affine 


projective 


Point addition 


1I+3M 


16M 


Point doubling 


1I+4M 


lOM 



Remark f. In a finite field of characteristic 2, the inverse of an elliptic curve 
point P = (x,y) is given as —P = {x,x + y). Having that an addition of two 
elements is calculated by bit by bit Xor, we get the inverse of an elliptic curve 
point for free again. 

2.2 Power Analysis 

Power analysis attacks use the fact that the instantaneous power consumption 
of a hardware device is related to the instantaneous computed instructions and 
the manipulated data. An unskilled implementation of an elliptic curve point 
duplication and an elliptic curve point addition, can therefore easily be used to 
mount a simple power attack (or simple power analysis, short SPA). An adver- 
sary just needs to monitor the devices power consumption and identify the parts 
of the power trace that correspond to the additions and duplications. This gives 
trivially the secret key. It is clear that in order to be SPA resistant, one must try 
to prevent data depending branches, as sketched in algorithm binalg'{P,M,k). 
Note, that the computational effort is much higher than in the standard binary 
algorithm. 
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binalg’(P,M,k) 

P= 1,Q = M 
for i = 0 to n — 1 
P[0] = P 
P[l] = P*Q 
Q = Q * Q 
P = P[H 

return P 



Differential power analysis uses more sophisticated, statistical techniques to 
attack the secret key. One power analysis variant is to partition the measure- 
ments in two (or more) different sets by some oracle (for instance the guess of a 
secret key bit) and then look if these two sets are statistically different. This will 
only be the case if the oracle was correct and thus reveal some parts of the key. 
Since statistical difference is usually computed by the distance— of —mean— test, 
which basically compares the means of two distributions, we will refer to this 
method as the mean method in subsequent sections. The second method com- 
putes the covariance between the measurements and the oracle. Also, only a 
correct oracle can correlate to the measurements (we will refer to this as the 
correlation method). We give some examples to clarify this description. 

Example 1 (Single-Exponent, Multiple- Data Attack). The SEMD attack [20] com- 
pares the power signal of an encryption operation using a known parameter 
(public key) to a power signal using an unknown parameter (secret key). The 
attacker can learn where the two signals differ and thus learn the unknown (se- 
cret) parameter. Due to noise components, direct comparisons of power signals 
are unreliable, thus DPA techniques are applied. One computes n random values 
with the secret and the known parameter. The average signals are calculated and 
subtracted as in the mean method . The portions of the DPA signal that depend 
on the (random) data will be wiped out by the averaging and subtraction. The 
portion of the DPA signal that is dependent on the parameter will average out 
to two different values depending on the performed operation. The portions in 
the DPA signal that are « 0 are data dependent or the operations in the binary 
algorithm agree. The other portions indicate that the operations in the binary 
algorithm differ. 

This attack also can be seen as an extension of a SPA attack, and therefore 
be prevented by the modification sketched in binalg'{P,M,k). Note that this 
variant does not make much assumptions on the cryptographic device. More 
sophisticated versions, that make more assumptions can be found in [20]. 

Example ^ ( Correlation Attack). When using algorithm binalg' the mean method 
will be not successful because there is no difference in the sequence of instruc- 
tions. But if one knows the representation of the computed points one can again 
mount a successful attack (which has been shown in [6]). At step i, the pro- 
cessed point depends only on the first bits kg .. . ki-\ of the secret parameter 
k. When P[i] is processed, power consumption is correlated to the bits of P[i]. 
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No correlation will be observed if the point is not computed. The value of the 
least significant bit of k can be learned by calculating the correlation between 
the power consumption and any specific bit of 2M. As one can see in algorithm 
binalg’ the only key dependent operation in the for-loop is whether the value 
of P[l] or P[0] is copied to P. If fco = 0, the value of P[0] (which is 1 in this 
case) remains in P and therefore a correlation between 2M and the power con- 
sumption in the subsequent path of the for-loop must be observed, otherwise if 
ko = 1, the value of P[l] (which is M in this case) will be copied to P and no 
correlation between 2M and the power consumption of the computation of P[l] 
in the subsequent path in the for-loop will be observed. The other bits can be 
recursively recovered in the same way. 

Coron also shows in [6] how to extend the correlation method to any scalar 
multiplication algorithm executed in constant time with a constant addition- 
subtraction chain. 



2.3 Countermeasures 

Basically all proposed countermeasures suggest blinding or randomizing the se- 
cret parameters. When computing P = kM one has the possibility to 

— randomize (blind) k: One needs to know the number of points ^E{K) 
on the elliptic curve. Then one chooses a random number r and calculates 
k' = k+r*^E{K). Obviously P = kM = k'M, because of ^E{K)*M = O. 
For this approach one has to store an additional parameter of the elliptic 
curve, on the cryptographic device, which is often not desirable. The second 
disadvantage is that depending on the bitlength of r * ^E{K), the effective 
keylength may increase. 

— blind M : A point is blinded by adding a secret random point R for which 
one knows S = kR. Scalar multiplication is done by calculating k{R + M) 
and subtracting S to get P = kM . The points R and S can be stored inside 
the cryptographic device and updated for each new execution as follows: 
R = (— 1)**2P and S = (—1)^25', where b is a random bit. Note that there 
must be stored two additional points inside the device, which is also often 
not desirable. 

— randomize M: Projective coordinates can be used to avoid the inversions 
as well as for randomization. Because of the fact that 

(A,y,Z) = (AA,AT,AZ), VA yf 0 

one can choose for each new execution another random A. As one can see, 
this variant relies on the usage of projective coordinates instead of affine 
coordinates. 



All these countermeasures require to store additional parameters or to make 
additional operations. 
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3 Speeding Up the Binary Algorithm 

As pointed out in section 2, an elliptic curve cryptosystem needs to efficiently 
compute the scalar multiplicative of an elliptic curve point. The simplest efficient 
method, the binary algorithm, is also the oldest. A good survey on (more recent) 
methods is [10]. Most of these methods try to give an answer to the question, 
how to find the shortest addition chain. An addition chain for an integer k is 
a list of positive integers a\ = 1, 02 , . . . , a/ = fc, such that for each i > 1, there 
is some j and m, with 1 < j < m < i and Oi = aj + am- Thus, if one has an 
addition chain with length I, one can compute k * P with I additions. Finding 
the best addition chain is impractical, but there are several methods for finding 
near-optimal ones. With elliptic curves one has the possibility to use addition- 
subtraction chains, because the computation of the inverse of a point has no cost. 
Morain and Olivos discuss in [23] two algorithms that use addition-subtraction 
chains. We describe their approach in the two subsequent sections. 



3.1 First Algorithm 

The idea comes from the observation that long chains of I’s in the binary repre- 
sentation of k are better treated by a subtraction. For instance if one calculates 
15 * P like 

15*P = 16*P-P = 2(2(2(2P))) - P, 

one has to perform less operations than in the standard binary algorithm. So the 
enhancement is to replace a block of at least two I’s in the binary representation 
of k, by a block of O’s and a — 1 : 1“ 1 — >■ 10““^ — 1. Automaton 1 in Figure 
1 represents this idea. Morain and Olivos state that the expected gain of this 
version is about 8.33%. 



3.2 Second Algorithm 

The idea is to treat isolated O’s inside a block of I’s. Using the map of the first 
algorithm it is 

UOl'’ 10“"^ - 110^"^ - 1. 

Since — 2-|-l = — Iwe can write —11 as 0 — 1 and therefore 

rOl*' 10“ - 10 '’"^ - 1. 

In automaton 2, the state 110 takes this modification into account. In both 
figures the input path is marked by a distinct arrow, and the output paths 
are marked by an additional bar. Intermediate states are drawn as circles, and 
transitions between these intermediate states are represented as arrows. The 
initial conditions for the automatons are P = 0 and Q = M. An iterative 
version of this algorithm can be found in [23] . The expected gain for this variant 
is about 11.11%. 
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Q=2Q Q=2Q 



Fig. 1. Automaton 1 




Q=2Q Q=2Q 



Fig. 2. Automaton 2 



4 The New Countermeasure 

In contrary to the previously described countermeasures which were introduced 
by Coron in [6], we intend to randomize the binary algorithm itself. This can 
be easily done by inserting a random decision in the two algorithms in section 
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3.1 and 3.2. For example, if we are in state 1 we draw a random variable e. If 
e = 0 we take the path of algorithm 1, else we proceed as in the standard binary 
algorithm. The finite automaton in figure 3 shows the randomized algorithm 
according to this idea. 




Q=2Q Q-2Q 



Fig. 3. Randomized Automaton 1 



In order to make SPA attacks more difficult, we changed all multipliers so that 
always only one double or one double and one add (or subtract) is necessary. 
Note, that the bits of the binary representation don’t correspond directly to 
doubles (resp. adds) anymore! For both, 1 and 0 in the binary representation, 
there is one path in the algorithm where, for instance, the double operation is 
performed. In the same manner we modified finite automaton 2 (see figure 4). 
In both figures, the paths that are randomized are drawn dash-dotted for better 
visibility. Again, in both figures the input path is marked by a distinct arrow, 
and the output paths are marked by an additional bar. Intermediate states are 
represented by circles, and transitions are represented as arrows between them. 
The initial conditions are again P = 0 and Q = M. For the sake of completeness 
we give an iterative algorithm implementing figure 4. 



4.1 Analysis of the Randomized Algorithms 

As noted before, the binary representation does not correspond directly to the 
performed operations anymore. A second observation is that due to the fact. 
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Fig. 4. Randomized Automaton 2 

randomized_version_automaton2 (k,M) •[ 
state=0 ; P=0 ; Q=M ; 
while (k>0) { 

if ((k&l) == 0) { 

if (state == 11) P = P+Q ; 
state=0;Q = 2*Q; 

} 

if ((k&l) == 1) { 
switch (state) {. 

case 1: e=rand() ; 

if (e==l) P= P+Q; 

else P = P-Q and state=3; 

Q = 2*Q; 

case 11: e = rand();Q = 2+Q; 

if (e==0) P = P+Q and state=l; 
case 0: P = P+Q ; Q=2*Q ; state=l ; 

} 

}k »=1; 

} 

if (state==ll) {P = P+Q;}- 
return P; 



Fig. 5. Iterative Version of the Randomized Automaton 2 
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that —P{x,y) = P{x,—y) e.g. —P(x,y) = P(x,x + y)(the add and subtract 
operations are basically the same), they aren’t distinguishable in the power trace. 
The difference of add (subtract) and double depends on the underlying number 
field, and the used coordinates. For instance, as listed in the table in section 
2.1, in GF{2"^) with affine coordinates, for both add and double operation, 1 
inversion, 2 multiplications and 1 squaring is needed. We consider now some 
possible scenarios for a power attack : 

SPA case : Suppose one has an implementation were the distinction between 
double and add (subtract) is possible with a single measurement (this could 
be the case when working with projective coordinates). It would be possible 
to identify a block of O’s at the beginning of the algorithm. Also, blocks of O’s 
result more likely in consecutive doubles than blocks of I’s. Basically there 
are more likely binary representation than others, and this could be used to 
identify the correct key. But that this is not as easy as mounting an SPA 
attack on the standard binary algorithm. 

DPA case : Let us assume now, that we don’t have such a dumb imple- 
mentation, and therefore the difference between a double and an add (or a 
subtract) operation is not visible with only one power measurement. Every 
time the algorithm is performed, it takes due to the randomization a differ- 
ent path. Therefore the sequence of doubles, adds, and subtracts is slightly 
different. If the random numbers are close to uniform, this makes an attack 
like the mean method infeasible. But also the correlation method does not 
work anymore. Because of the randomization, the intermediate values that 
are attacked, are computed at different times, or are sometimes not even 
calculated. This washes out the DPA bias signal. 

The performance of the randomized algorithms is close to the standard binary 
algorithm. The following table shows the percentage of additional operations 
(additions and subtractions) in comparison with the ordinary binary algorithm 
for various key lengths (which were chosen according to the commonly suggested 
field sizes). For this table we counted the number of additions and subtractions 
(the number of doublings is for both variants approximately the same) for several 
thousands of executions (for several values of k) of the algorithm given in figure 5. 
The table shows that the additional number of operations is almost independent 
of the keylength and is approximately 9%. 

What can we say about the storage of additional parameters? The speedup 
in the article of Morain and Olivos uses the bottom-up variant of the binary 
algorithm, which requires the additional storage of one elliptic curve point. This 
is obviously a disadvantage. On the other hand, the proposed standard IEEE 
PI 363a, includes this version of the binary algorithm. The other additional pa- 
rameter which we have to store, is the random bit e. 

5 Conclusion 

We described an alternative approach for the development of countermeasures 
against power attacks. Our countermeasure does not depend on any input pa- 
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Table 1. Performance comparison 



Bitlength 


Additional operations 
(mean value) 


Variance 


112 


9% 


3% 


128 


10% 


5% 


160 


9% 


3% 


192 


8% 


3% 


224 


8% 


3% 


236 


7% 


3% 


384 


9% 


2% 


521 


9% 


2% 



rameters of the binary algorithm, but on the algorithm itself. We’ve analyzed 
its efficiency in preventing the mean and the correlation method, and its effi- 
ciency in performance. Fact is, that our method prevents recent power analysis 
attacks largely without the necessity to store large additional parameters or do 
any precomputations. One can also combine this countermeasure with the other 
suggested countermeasures to achieve a higher security level. 
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Abstract. This paper discusses the architectural optimizations for a special 
purpose ASIC processor that implements the AES Rijndael Algorithm. In 
October 2000 the NIST chose Rijndael as the new Advanced Encryption 
Standard (AES). The algorithm has variable key length and block length 
between 128, 192, or 256 bits. VLSI architectural optimizations such as 
parallelism and distributed memory are discussed, and several hardware design 
techniques are employed to increase performance and reduce area consumption. 
The hardware architecture is described using Verilog XL and synthesized by 
Synopsys with a O.lSgm standard cell library. Results show that with a design 
of 173,000 gates, data encryption can be done at a rate of 1.82 Gbits/sec. 



1 Introduction 

Although many encryption algorithms can be relatively efficiently implemented in 
software on general-purpose or embedded processors, there is still a need for special 
purpose cryptographic processors. 

First of all, high throughput applications, such as the encryption of the physical 
layer of Internet traffic, require an ASIC that does not affect the data throughput. For 
example, software implementation of the Rijndael algorithm on a Pentium 200 Pro 
yields a throughput of around 100 Mbits/sec [1], which is too slow for high-end 
Internet routers. 

Moreover, in terms of mobile application like cellular phones, PDA’s, etc., 
software implementation on general-purpose processors consumes much more power 
than special purpose ASIC's do. Last of all, it is often the case that applications 
require the encryption logic be physically isolated from the rest of the system, so that 
the encryption can be secured more easily. In this case a hardware accelerator is a 
more suitable solution as well. 

The AES Rijndael algorithm was chosen in October 2000 and is expected to 
replace the DES and Triple DES because of its enhanced security levels [2]. In this 
paper, VLSI optimizations of the Rijndael algorithm are discussed and several 
hardware design modifications and techniques are used, such as memory sharing and 
parallelism. The circuits are synthesized using a O.lSjTm CMOS standard cell library, 
and estimations are done on timing and gate counts. 
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In the following sections, we will briefly discuss the algorithm flow, followed by 
detailed hardware implementations and techniques. After that we will present the 
simulation results, followed by future developments and conclusions. 



2 Rijndael: Algorithm Flows 

The main flow of the algorithm, as shown in Fig. 1, uses many lookup tables and 
XOR operations. The algorithm accepts blocks of size 128, 192, or 256 bits. 
Independently, the key length can be 128, 192, or 256 bits as well. All encryptions 
are done in a certain number of rounds, which varies between 10, 12, and 14, and it 
depends on the size of the block length and the key length chosen. An encryption 
module is used to generate all the intermediate encryption data, and a separate key- 
scheduling module is used to generate all the sub-round keys from the initial key. 

For encryption, it can be divided into four blocks: Key Addition, Shift Row, Mix 
Column, and Substitution. The Key Addition module is byte XOR between the round 
key and the encryption data. The Shift Row and the Substitution modules involve 
mainly table lookups. Last of all, the Mix Column module composes of XOR 
operations. The algorithm flow is shown in Fig. 1. The Key Scheduling module is 
totally independent of the encryption module, and it also involves table lookups and 
XOR operations. 



Sub_key Sub_key 




S u b_key 

Fig. 1. Algorithm Flows. 



There are a total of three sets of tables used by key scheduling and encryption. 
One of them is 256 bytes; one of them contains 30 bytes; the remaining one has 24 
bytes of entries. 



3 Architecture Optimizations 

The initial specification of the Rijndael algorithm was implemented mainly in 
software. Although the algorithm is designed with hardware implementation in mind, 
the transition from software to hardware involves modifications. The main challenge 
in the hardware implementation is to maximize the encryption throughput while 
minimizing the area consumption at the same time. Maximizing the throughput will 
minimize the critical paths and solve the memory access conflicts. As shown in Fig. 2 
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[3], there are a lot of regularities in the design of Rijndael algorithm. Therefore, with 
careful VLSI design, the critical path as well as the overall area can be minimized. 



Substitution 



ShiftRow 



M ixC olum n 



KeyAdd 









71^ 




Fig. 2. 2D diagram illustrating data flow, adapted from [3]. 



3.1 Basic Architecture Decisions 

In our implementation of the algorithm, there is only hardware for one encryption 
round and we re-use the same piece of hardware to complete the whole encryption 
process. While this implementation can help conserve most area, the main reason for 
this design is to incorporate different kinds of feedback modes that are currently 
available in the industry. Although NIST is currently initiating another new counter 
mode of operation, the common mode of operations used today do not allow 
pipelining of encryption modules. Therefore having two or more encryption modules 
in the processor is not the most flexible design. 

Besides having hardware for one encryption round, we also designed the processor 
to complete one encryption round in one clock cycle. This design is very important, 
for example, in high throughput systems, because it ensures that the design is run at 
the lowest clock frequency possible with the same throughput. The drawback of this 
design is that we have to duplicate some of the modules, especially lookup tables, in 
order to finish all the required operations in one clock cycle for one encryption round. 

The third basic architecture decision we made was the key scheduling. There are 
two ways for generating the round keys for encryption, either by generating all the 
sub-keys beforehand and storing them in a buffer, or generating all the sub-keys on 
the fly in parallel with the encryption module. 

Since buffer storage could take up substantial amount of space, we decided to 
generate the sub-keys on the fly during encryption. Therefore we implemented the 
hardware required to generate one set of sub-key and re-use it for calculating all the 
sub-keys, and at the same time also use one clock cycle for one sub-key generation. 
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3.2 System Setup 

The general block diagram is shown in Fig. 3. Besides the Encryption and Key 
Scheduling modules, there are one controller for the input channel, one controller for 
the output channel, and a top-level controller interfacing with the user modules. 
There is only one system clock, and it is fed to all the modules. 



ISA 




Fig. 3. Overall block diagram. 

Both the input and output channels are 16 bits wide. Therefore, in order to read in 
the whole cipher or key, a handshaking protocol is used. The top-level controller 
takes in a 4-bit instruction and returns a ready signal when it is idle. In order to allow 
both 128, 192, and 256 bits for Encryption and Key Scheduling, the internal data path 
are all 256 bits. The user has the ability to set the block length and the key length 
using specific instructions, and the input and output controller will automatically 
adjust the input and output sequences. 

Specifically, pipelining and unrolling are not implemented in the system. As a 
result, there is only one module for Encryption and one module for Key Scheduling, 
and these modules are reused to generate all the intermediate data and key. This 
design should be the most area efficient with the best module utilization. 

As shown in table 1, the instructions are four bits long. Feedback modes (1 1 10 and 
0110) take in the raw data and encrypt the data for one thousand times using OFB 
feedback mode, and this is used for calculating the maximum operating frequency of 
the core during tests. Decryption is not implemented in the current design since it 
requires a separate datapath. Nonetheless, in order to implement decryption, either 
the generation of the entire sub round-keys has to be done beforehand, or there needs 
to be another datapath generating the inverse process of Key Scheduling. The first 
case requires an additional 3584 register storage while the second method requires 
more routing, both result in much larger area. 
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Table 1. Instruction sets used for this processor. 



Reset 


0000 


Set Block Length 


128 bits 


1010 


192 bits 


1011 


256 bits 


1100 


Set Key Length 


128 bits 


0010 


192 bits 


0011 


256 bits 


0100 


Input Data 


1001 


Input Key 


0001 


Encrypt 


1101 


Encryption - Feedback Mode (for testing) 


1110 


Decryption 


0101 


Decryption - Feedback Mode (for testing) 


0110 


Output Data 


0111 



3.3 Memory Architecture Optimization 

Since the design is based on one clock cycle for each encryption round, we have to 
duplicate memory modules several times. Consequently, the choice of memory 
architecture is very critical. Since all the table entries are fixed and defined in the 
standard. Random Access Memory (RAM) is not needed, but in fact Read Only 
Memory (ROM) is enough. Specifically, the algorithm will require a lot of small 
ROM modules instead of one large memory modules, since each lookup will only be 
based on a maximum of 8-bit address, which translates to 256 entries. However, the 
ROM has to be asynchronous; otherwise several clock cycles would be required for 
all the memory reads. In our design, combinational logic is used to implement the 
table lookups. 

There are three types of tables we used in our design. The first one, which is the 
most used, is the S-box. It is a 256-entry table with each entry 8-bit. Using 
combinational network we were able to use around 2200 gates to translate the table, 
which converts to around 51000|a,m^. The access time for the table is around 1.89ns. 
We have a total of 48 copies of the table in our design; 32 of them in the Encryption 
module and 16 in the Key Scheduling module. 

The second table lookup is for deciding the shift amount in the shift row module, 
which has 24 entries. We implemented four copies of the tables in our design, and we 
were able to achieve that using 55 gates with an area of 1000|a,m^. The last type of 
table lookup has 30 entries, and it is used to generate the round constant in the key- 
scheduling module. It is only accessed once in each round, so we have only one copy 
of the table, with 70 gates occupying 1300|im^. 



3.4 Simplification of Modulus Operation 

There are several modulus operations in the algorithm: modulo 4, 6, and 8. Since the 
modulus values are known already, generic modulo operations are unnecessary since 





56 H. Kuo and I. Verbauwhede 



they require a lot of area. Therefore, it is beneficial to look into the data set and break 
down the modulus operations into more efficient combinational logic, which 
consumes less area. 

For the modulus 4 and 8 operations, they are relatively easy to implement using 
simple shifting. Result of modulus 4 is the last 2 bits in the operand, and result of 
modulus 8 is the last 3 bits of the operand. In simplifying modulus 6, it is necessary 
to look at the set of values the operands take since there is no simple method for 
reducing them. In the algorithm, modulus 6 takes on values from 0 to 13, therefore a 
Karnaugh Map was used to implement the operation efficiently using gates. 



3.5 Encryption Datapath 

As discussed before, the encryption module can be broken down into four different 
sub-modules, and the same case applies on the hardware implementation of the 
algorithm. We implemented the four modules (Substitution, Shift Row, Mix Column, 
and Key Addition) using mainly lookup tables, XOR’s, and pure combinational logic. 
Moreover, the datapath is 256 bits wide despite of the actual block length. 

3.5.1 Substitution 

The 256 bit data is broken down into 32 chunks, 8 bit each, and each of them is used 
as the address for S-box table lookup. The S-box contains 256 entries, and each entry 
is 8 bits wide. The S-box is implemented using combinational logic with an access 
time of around 1.89ns. In order to achieve parallelism and finish one round of 
encryption in one clock cycle, the same S-box is duplicated 32 times. Fig. 4 shows 
the block diagram for this module. 



3 2 bytes 




input 



o u tp u t 




Fig. 4. Block diagrams for Substitution. 



3.5.2 Shift Row 

Inside Shift Row, the 256 bit data is broken down into four chunks. Each of the 64- 
bit chinks is called a roll and it contains eight bytes. Byte-wise cyclic shifts will be 
performed on each “row” (Fig. 5), and the amount of shifts is determined by the block 
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length through a simple table lookup (24 entries). Modulus 4, 6, and 8 operations 
determine the boundaries on which wrap around happens. 




Fig. 5. Block diagram for Shift Row (only one of the four 8-hyte “rows” is shown). 



3.5.3 Mix Column 

In Mix Column, four bytes in the corresponding position in the four “rows” are used 
for matrix multiplication in GF(2*), which involves byte-wise multiplication and 
addition. Byte-wise additions are easily done by XOR, and several tricks are used for 
multiplications. 

Byte-wise multiplications include multiplying the data by 1,2, and 3. Multiplying 
by 1 the data remains the same. For multiplication by 2, the 8 bit data is left shifted 
by 1 bit, and the LSB is replaced by 0. Then the MSB of the original data is used for 
comparison. If it is 0, then the left shifted data is the result; if it is 1, then the left 
shifted value is XORed with the reduction polynomial, in this case 00011011, to 
generate the result. For multiplication by 3 we simply XOR the original byte with the 
result of multiplication by 2. 

Using the above method, the multiplications by 1, 2, and 3 of each of the bytes in 
the data are determined. Then the correct combinations of values are XORed with 
each other to produce a new byte. The same process goes on until all the 32 bytes in 
the data are replaced. 

Fig. 6 shows the block diagram for generating the first byte of each row. 

3.5.4 Key Addition 

In Key Addition, the 256 bit data is XORed with the 256 bit keys to generate the 
result, as shown in Fig. 7. 



3.6 Key Scheduling Datapath 

The datapath for Key Scheduling is also 256 bits wide to accommodate different 
key lengths. Moreover, the sub-keys are all generated on the fly, meaning that there 
is no buffer storage for keys generation. 
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Row 0 Row 0 Row 0 Row 0 




x2 x3 xl x2 x3 xl x2 x3 xl x2 x3 xl 




Fig. 6. Block diagram for Mix Column (only byte 0 calculation is shown). 



32 




Fig. 7. Block diagram for Key Addition. 



3.6.1 Datapath breakdown 

The datapath can be broken down into three parts. In the first part, the 256-bit key is 
separated into four 64-bit “rows,” and the lowest byte of each “row” is used as the 
address to access the S-box. The returned 8-bit result is XOR with the original byte to 
produce the new byte. For parallel access the S-box is duplicated four times. 

The second part involves XOR between the zeroth byte with the round constant. A 
pointer, which increments every clock cycle, is used as an address to access the 30- 
entry round constant table for the round constant. 

In the third part, the 256-bit data is again broken down into four “rows” of 64 bits 
each. Each “row” contains eight bytes, and each byte is XORed with the previous 
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byte in a sequential manner. The block diagram is shown in Fig. 8. Since the 
datapath is slightly different for Key Length of 256 bits, a MUX is used for the 
selection of the fourth byte and is controlled by the key length. 




Fig. 8. Block diagram for Key Scheduling (only one of the four 8-hyte “rows” is shown). 

3.6.2 Key Alignment 

Since the Rijndael algorithm allows different key lengths and block lengths, each sub- 
key is carefully set to have the same length as the data do. From the specification of 
the algorithm, the original key is used to generate a sequence of the entire sub-key 
stream, and chunks of sub-keys are selected for the encryption module according to 
the block length. This algorithm works if we have a buffer storage in our design to 
store the whole sub-key sequence, but is not applicable to our implementation. 

In the case of 128-128 (block-key), 192-192, and 256-256 the generated sub-keys 
could be fed into the encryption module directly with any reorganization (Fig. 9a). 
Flowever, in the case of 256-128, since both the encryption and key-scheduling 
modules are sharing the same clock, it means that the key- scheduling module has to 
create two set of 128-bit sub-keys to combined for the 256-bit sub-key for the 
encryption module (Fig. 9b). 

On the other hand, in the case of 192-128, the original 128-bit keys are used for the 
lower 128 bits of the sub-key fed to the encryption module. Then the 128-bit key 
goes through the key-scheduling module to generate the next set of 128-bit sub-key. 
The lower half of this key is used as the upper 64 bits of the first sub-key fed into the 
encryption module, and the upper half is used for the next sub-key (Fig. 9c). In this 
case we will sometimes need to access the next sub-key and sometimes the previous 
sub-keys. 
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S u b -k e y 
a ( 1 2 8 -1 2 8 ) 



2 5 6 

b (2 5 6 -1 2 8 ) 

Fig. 9. Illustration of alignments of sub-keys. 



1 92 

c (1 9 2 - 1 2 8 ) 



3.6.3 Key Scheduling Architecture 

By careful analysis of all the nine combinations between the Block Length and Key 
Length, we noticed that in the worst case the Key Scheduling module will need to 
maintain the previous, current, and also the next sub-keys in order to generate the 
appropriate set of keys that are fed into the encryption module. 

Therefore, we decided to implement two sets of the encryption modules to achieve 
this. Fig. 10 shows the block diagram of our design. An extra selection module is 
used to select from the three sub-keys, based on the key length, block length, and the 
round count, the correct combination of keys that should be fed to the encryption 
module. 




Fig. 10. Architecture of Key Scheduling used. 
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4 Results 

The hardware design is done using the Cadence Verilog-XL, and synthesis was 
done using Synopsys DesignCompiler and National Semiconductor’s 0.1 Spm 
standard cell library. The synthesis was done using two libraries: the worst-case 
library, which uses 1.2V at 120F and worst case processing, and the typical-case 
library, which uses 1.8V at 60F with best processing parameters. Results are in table 
2 . 



Table!. Synthesis Results. 





Worst-case library 


Typical-case Library 


Critical Path 


21ns 


10ns 


Frequency 


48MHz 


lOOMHz 


Chip Area 


4.23mm^ 


3.96mm^ 


Gate Count 


184,000 


173,000 


Max. Throughput (256 bits data 
/ 128 bits key) 


870 Mbits/sec 


1.82 Gbits/sec 


Min. Throughput (128 bits data 
/ 256 bits key) 


435 Mbits/sec 


910 Mbits/sec 



The critical path lies in the Key-Scheduling module, and it is shown in Fig. 11. It 
involves going through a S-box lookup XOR, and then the round constant lookup and 
XOR, followed by a sequence of XOR and one more S-box lookup. This path is 
duplicated one more time since we have two key-scheduling modules, and since one 
path is around 4.5ns, going through the two modules would take a total of 9ns. 
Together with the sub-key selection module, which is around 3ns, the whole critical 
becomes 10ns. 



Sub-key 1 




Fig. 11. Critical path for Key Scheduling. 
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The critical path in the Encryption module is illustrated in Fig. 12. It involves a S- 
box lookup, then the shift row module, which includes table lookup and XOR, four 
sets of XOR in Mix Column, and a final XOR operation in key addition. The overall 
path is around 6ns. 



Data 



Data 



Substitution 



S h i f t R o w 



Table lookup 



Table lookup | — 



K c y A d d 



M i X C 0 I u in n 



Fig.12. Critical path breakup on Encryption module. 



Since the critical path is as long as 10ns, the system could operate under a clock of 
lOOMHz in typical environment. When calculating the throughput, we measure the 
critical path of the processor core (Encryption and Key Scheduling modules), 
calculate the time to finish one encryption, and determine the throughput. 

In the worst case, where the cipher is 128 bits, the key is 256 bits, and the 
encryption requires 14 rounds, the throughput is 910 Mbits/sec. In the best case, 
where the cipher is 256 bits and the encryption takes 14 rounds, the throughput is 1.82 
Gbits/sec. For comparison, in software implementation, on a Pentium Pro 200MHz 
Pro system running Linux, the best-case throughput is about 100 Mbits/sec. 
Compared to the hardware implementation, the hardware implementation is about 18 
times faster. 

The whole chip has a size of around 3.96mm^, with a gate count of around 173,000 
gates. The input and output controller each takes 1.6% of the overall area, and the 
top-level controller takes around 3.9% of the overall area. The Key Scheduling 
module consumes about 35% of the area, and the remaining Encryption module 
occupies 57.5% of the overall area. All these data are summarized in table 3. 

On the other hand, each 256 bytes table consumes about 5100pm^. In the whole 
system, together with the four tables for Shift Row and the one for round constant are 
very small, all the lookup tables combine to 2.5mm^, around 63% of the overall area. 
In terms of register storage, the current design requires a total of 13200pm^ for 
registers, which is about 8% of the overall area. Therefore, all memory components, 
including registers and table lookups, occupy around 71% of the area of the chip. 



Table 3. Comparison between Encryption and Key Scheduling modules. 





Encryption 


Key Scheduling 


Area 


2.28mm"^ 


1 .39mm^ 


Gate Count 


99,300 


60,100 


Percentage of Chip Size 


57.5% 


35% 


Critical Path 


6ns 


10ns 
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Table 4 compares the design described in this paper with the design by National 
Security Agency (NSA) [6]. The research conducted by NSA was primary used as a 
reference for NIST, therefore it did not include special architecture techniques in 
order to create fair results between all AES candidates. Also, notice the library used 
was a 0.5|im library. 



Table 4. Comparison with results from NSA. 





Design from NSA (0.5pm) 


Design in this paper (0. 18|am) 


Chip Area 


46mm^ 


3.96mm^ 


Gate Count 


1.000,000 


173,000 


Max. Throughput 


447 Mbits/sec 


1.82 Gbits/sec 


Min. Throughput 


320 Mbits/sec 


910 Mbits/sec 



From our results, we noticed that the generation of sub-keys on the fly creates a 
serious bottleneck for the system. Since the encryption module has a critical path of 
around 6ns and the key scheduling module has a critical path of 10ns, the encryption 
module is idle for almost 4ns. If we could reduce path inside the key scheduling 
module to around 10ns the throughput would be maximized. 

This implementation is entirely possible. As we have discussed, one key- 
scheduling module has a critical path of around 4.5ns, therefore if we implement 
some buffer storage for sub-key generation, where we only need to maintain one key- 
scheduling module, the critical path inside key-scheduling drop substantially from 
10ns to at most 5ns, which matches precisely with the encryption module. 

The tradeoffs with this implementation would be the excessive area for buffer 
storage and also the time required to generate all the sub-keys before encryption can 
start. By analyzing the current Key Scheduling module, each of the two sub-key 
generation parts consumes 0.53mm^ and the sub-key selection module consumes 
0.33mm^, whereas 3584 bits of register storage takes up around 0.5mm^. Therefore if 
we generate all sub-keys ahead of time, we can save the sub-key selection module and 
one sub-key generation module, replace that by 3584 bits of register storage and 
actually save around 0.35mm^ of chip size. 

On the other hand, although the critical path could be reduced from 10ns to 6ns, 
the new design would require time to initialize all the keys. In the worst case, where 
block size is 256 bits and key size is 128 bits, it would require 28 cycles to generate 
all the required sub-keys for encryption. Compared to the 14 cycles required for 
actual encryption, the overhead could be as much as 200%. 



5 Conclusion 

In this paper, a hardware implementation of the AES Rijndael algorithm is described. 
In order to better fit the algorithm for hardware implementation, several modifications 
are introduced, including memory access, modulo reduction, and key scheduling on 
the fly. Synthesized using a 0.18|a,m library, the gate count is estimated to be around 
173,000. It can sustain a maximum throughput of around 1.82 Gbits/sec at a clock 
frequency of lOOMHz, which is substantially faster than the software implementation. 
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Moreover, area tradeoff for memory sharing and addition of deeryption is also 
diseussed. 

For future development, estimation on the real time required for key initialization 
and time for a whole eneryption should he done on the real ehip. Moreover, more 
detailed estimation should he done on the aetual area inerement for the addition of 
deeryption. Power eonsumption analysis is essential as well for mobile applieation, 
and researeh on the aetual resistanee towards timing and power attaek will be 
investigated [7]. Last of all, analysis on using buffer and sub-key pre-caleulation 
should be implemented should be done as well. 

Acknowledgements: UC Miero #00-097, Atmel Corporation, Panasonie, and 
National Semieonduetor Corporation sponsored this work. 
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Abstract. This paper describes high performance single-chip FPGA 
implementations of the new Advanced Encryption Standard (AES) algorithm, 
Rijndael. The designs are implemented on the Virtex-E FPGA family of 
devices. FPGAs have proven to be very effective in implementing encryption 
algorithms. They provide more flexibility than ASIC implementations and 
produce higher data-rates than equivalent software implementations. A novel, 
generic, parameterisable Rijndael encryptor core capable of supporting varying 
key sizes is presented. The 192-bit key and 256-bit key designs run at data rates 
of 5.8 Gbits/sec and 5.1 Gbits/sec respectively. The 128-bit key encryptor core 
has a throughput of 7 Gbits/sec which is 3.5 times faster than similar existing 
hardware designs and 21 times faster than known software implementations, 
making it the fastest single-chip FPGA Rijndael encryptor core reported to date. 
A fully pipelined single-chip 128-bit key Rijndael encryptor/decryptor core is 
also presented. This design runs at a data rate of 3.2 Gbits/sec on a Xilinx 
Virtex-E XCV3200E-8-CG1 156 FPGA device. There are no known single- 
chip FPGA implementations of an encryptor/decryptor Rijndael design. 

Keywords: FPGA Implementation, AES, Rijndael, Encryption 



1 Introduction 

In September 1997 the National Institute of Standards and Technology (NIST) issued 
a request for possible candidates for a new Advanced Encryption Standard (AES) to 
replace the Data Encryption Standard (DES). In August 1998, 15 candidate 
algorithms were selected and a year later, in August 1999 five finalists were 
announced: MARS, RC6, Rijndael, Serpent and Twofish. On 02 October 2000, the 
Rijndael algorithm [1], developed by Joan Daemen and Vincent Rijmen was selected 
as the winner of the AES development race. In performance comparison studies 
carried out on all five finalists [2, 3, 4, 7], Rijndael proved to be one of the fastest and 
most efficient algorithms. It is also easily implemented on a wide range of platforms 
and is extendable to other key and block lengths. 

In this paper two fully pipelined Rijndael algorithm designs are presented. The 
designs are implemented using Xilinx Foundation Series 3.1i software on the Virtex- 
E FPGA family of devices [5]. A fully pipelined Rijndael design requires 
considerable memory, hence, its implementation is ideally suited to the Virtex-E and 
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Virtex-E Extended Memory range of FPGAs, whieh eontain deviees with up to 280 
RAM Bloeks (BRAMs). The first design presented is an eneryption-only design 
eapable of supporting 128-bit, 192-bit and 256-bit keys. The 128-bit key design is 
implemented on the XCV812E-8-BG560 deviee. The 192-bit and 256-bits are 
implemented on XCV3200E-8-CG1 156 deviees, as they are too large to plaee on the 
XCV812E deviee. The authors are not aware of any other Rijndael hardware design 
eapable of supporting varying key sizes. However, software designs do exist. The 
fastest known software implementations of Rijndael are by Brian Gladman [ 6 ]. On a 
933 MHz Pentium III proeessor, his 128-bit key design aehieves a throughput of 325 
Mbits/see, the 192-bit key design runs at 275 Mbits/see and the 256-bit key design at 
236 Mbit/see. Hardware implementations of a 128-bit key design do exist. An 
implementation on a Virtex XCVIOOO FPGA deviee by Gaj and Chodowiee [4] 
aehieved a data-rate of 331.5 Mbits/see. Dandalis, Prasanna and Rohm [2] earried out 
an implementation on a Xilinx Virtex deviee and aehieved an eneryption rate of 353 
Mbits/see. A partially unrolled design by Elbirt, Yip, Chetwynd and Paar [7] on the 
Virtex XCV1000-BG560 FPGA performed at a data-rate of 1937.9 Mbits. The seeond 
design [ 8 ] presented is eapable of both eneryption and deeryption operations and is 
implemented on an XCV3200E-8-CG1 156 deviee. There are no known single-ehip 
FPGA implementations of the Rijndael algorithm, whieh perform both eneryption and 
deeryption. However, lehikawa, Kasuya and Matsui’s [3] implementation on a CMOS 
ASIC aehieves 1950 Mbits/see. The eneryptor/deeryptor implementation by Weeks, 
Bean, Rozylowiez and Fieke [9] performs at a rate of 5163 Mbits/see on a CMOS 
ASIC. 

Seetion 2 of the paper provides a deseription of the Rijndael Algorithm. Seetion 3 
outlines the design of the fully pipelined Rijndael implementations. Performanee 
results are given in seetion 4. Finally, eoneluding remarks are made in seetion 5. 



2 Rijndael Algorithm 

The Rijndael algorithm is an iterated bloek eipher. The bloek and key lengths ean be 
128, 192 or 256 bits. The NIST requested that the AES must implement a symmetrie 
bloek eipher with a bloek size of 128 bits, henee the variations of Rijndael whieh ean 
operate on larger bloek sizes will not be ineluded in the aetual standard. Rijndael also 
has a variable number of iterations or ^rounds’: 10, 12 and 14 when the key lengths 
are 128, 192 and 256 respeetively. The transformations in Rijndael eonsider the data 
bloek as a 4 eolumn reetangular array of 4-byte veetors or State, as shown in Fig. 1. A 
128-bit plaintext eonsists of 16 bytes, Bq, Bi, B2, B3, B4... B14, B15. Henee, Bq 
beeomes Pq,o, Bi beeomes Pi_o, B 2 beeomes P 2 ,o ... B 4 beeomes Pq,i and so on. The 
key is also eonsidered to be a reetangular array of 4-byte veetors, the number of 
eolumns, Ni, of whieh is dependent on the key length. This is illustrated in Fig. 2. The 
algorithm design eonsists of an initial data/key addition, nine, eleven or thirteen 
rounds when the key length is 128-bits, 192-bits or 256-bits respeetively and a final 
round, whieh is a variation of the typieal round. The Rijndael key sehedule expands 
the key entering the eipher so that a different sub-key or round key is ereated for eaeh 
algorithm iteration. An outline of Rijndael is shown in Fig. 3. 
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Fig. 1. State Rectangular Array 
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Fig. 2. Key Rectangular Array 




Fig. 3. Outline of 128-bit Key Rijndael Encryption Algorithm 



2.1 Rijndael Round 

The Rijndael round comprises a ByteSub Transformation, a ShiftRow 
Transformation, a MixColumn Transformation and a Round Key Addition. The 
ByteSub transformation is the s-box of the Rijndael algorithm and operates on each of 
the State bytes independently. The s-box is constructed by finding the multiplicative 
inverse of each byte in GF(2*). An affine transformation is then applied, which 
involves multiplying the result by a matrix and adding to the hexadecimal number 
‘63’. In the ShiftRow transformation, the rows of the State are cyclically shifted to the 
left. Row 0 is not shifted, row 1 is shifted 1 place, row 2 by 2 places and row 3 by 3 
places. The MixColumn transformation operates on the columns of the State. Each 
column is considered a polynomial over GF(2*) and multiplied modulo x^'+l with a 
fixed polynomial c(x), where, 

c(x) = ‘03 + ‘01 + ‘01 ’x + ‘02’ (1) 

Finally the State bytes and round-key bytes are XORed in Round Key Addition. A 
typical Rijndael round is illustrated in Fig. 4. In the final round the MixColumn 
transformation is excluded. 

2.2 Key Schedule 

The Rijndael key schedule consists of two parts: Key Expansion and Round Key 
Selection. Key Expansion involves expanding the cipher key into a linear array of 4- 
byte words, the length of which is determined by the data block length, Nt, multiplied 
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Fig. 4. Rijndael Round 



by the number of rounds, plus 1, i.e. Nh * {N^ + 1). The data block length, Nh = 4. 
When the key block length, Ni~ A, 6 and 8, the number of rounds is 10, 12 and 14 
respectively. Hence the lengths of the expanded key are as shown in Table 1. 



Table 1. Length of Expanded Key for Varying Key Sizes 



Data Block Length, W* 


4 


4 


4 


Key Block Length, Nk 


4 


6 


8 


Number of Rounds, W, 


10 


12 


14 


Expanded Key Length 


44 


52 


60 



The first Nk words of the expanded key contain the cipher key. When V* = 4 or 6, 
each remaining word, W[i], is found by XORing the previous word, W[i-1] with the 
word positions earlier, W[iW*;]. For words in positions, which are a multiple of 
a transformation is applied to W[i-1] before it is XORed. This transformation involves 
a cyclic shift of the bytes in the word. Each byte is passed through the Rijndael s-box 
and the resulting word is XORed with a round constant. However, when Nk = 8, an 
additional transformation is applied. For words in positions, which are a multiple of 
{Nk-i + 4), each byte of the word, W[i-1], is passed through the Rijndael s-box. The 
round keys are selected from the expanded key. In a design with W rounds, W -i-l 
round keys are required. For example a 10-round design requires 11 round keys. 
Round key 0 is W[0] to W[3] and is utilized in the initial data/key addition, round key 
1 is W[4] to W[7] and is used in round 0, round key 2 is W[8] to W[1 1] and used in 
round 1 and so on. Finally, round key 10 is used in the final round. 



2.3 Decryption 

The decryption process in Rijndael is effectively the inverse of its encryption process. 
It comprises an inverse of the final round, inverses of the rounds, followed by the 
initial data/key addition. The data/key addition remains the same as it involves an 
XOR operation, which is its own inverse. The inverse of the round is found by 
inverting each of the transformations in the round. The inverse of ByteSub is obtained 
by applying the inverse of the affine transformation and taking the multiplicative 
inverse in GF(2®) of the result. In the inverse of the ShiftRow transformation, row 0 is 
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not shifted, row 1 is now shifted 3 places, row 2 by 2 places and row 3 by 1 place. 
The polynomial, c(x), used to transform the State columns in the inverse of 
MixColumn is given by, 

c(x) = -h ‘OD’x^ -h ‘09’x + ‘OE’ 

Similarly to the data/key addition. Round Key addition is its own inverse. During 
decryption, the key schedule does not change, however the round keys constructed are 
now used in reverse order. For example, in a 10-round design, round key 0 is still 
utilized in the initial data/key addition and round key 10 in the final round. However, 
round key 1 is now used in round 8, round key 2 in round 7 and so on. 



3 Design of Pipelined Rijndael Implementations 

The Rijndael algorithm implementations presented in this paper are based on the 
Electronic Codebook (ECB) mode. Although ECB mode is less secure than other 
modes of operation, it is commonly used and its operation can be pipelined [10]. The 
fully pipelined Rijndael implementation will also operate in Counter mode. Counter 
mode is a simplification of Output Feedback (OFB) mode and it involves updating the 
input plaintext block, P, as a counter, = -P^+l, rather than using feedback. Hence, 
the ciphertext block, C, is not required in order to encrypt plaintext block, P-l-1 [11]. 
Counter mode provides more security than ECB mode and operation in either mode 
will achieve high throughputs. 

A number of different architectures can be considered when designing encryption 
algorithms [7]. These are described as follows. Iterative Looping (IE) is where only 
one round is designed, hence for an n-round algorithm, n iterations of that round are 
carried out to perform an encryption. Loop Unrolling (LU) involves the unrolling of 
multiple rounds. Pipelining (P) is achieved by replicating the round and placing 
registers between each round to control the flow of data. A pipelined architecture 
generally provides the highest throughput. Sub-Pipelining (SP) is carried out on a 
partially pipelined design when the round is complex. It decreases the pipeline’s delay 
between stages but increases the number of clock cycles required to perform an 
encryption. 

The Rijndael designs described in this paper are coded using VHDL and are fully 
pipelined: the encryption design having ten, twelve or fourteen pipeline stages and the 
encryption/decryption design having ten pipeline stages. 



3.1 Design of Generic Rijndael Encryptor Core 

The main consideration in both designs is the memory requirement. The Rijndael 
s-box in the ByteSub transformation can be implemented as a look-up table (LUT) or 
ROM. This proves a faster and more cost-effective method than implementing the 
multiplicative inverse operation and affine transformation. Since the State bytes are 
operated on individually, each Rijndael round requires sixteen 8-bit to 8-bit LUTs. In 
the key schedule, LUTs can also be used, as words are passed through the s-box. The 
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Virtex-E and Virtex-E Extended Memory range of FPGAs are utilized for 
implementation as they contain devices with up to 280 Block SelectRAM (BRAM) 
memories. A single BRAM can be configured into two single port 256 x 8-bit RAMs, 
hence, eight BRAMs are used in each round. When the write enable of the RAM is 
low (‘0’), transitions on the write clock are ignored and data stored in the RAM is not 
affected. Hence, if the RAM is initialized and both the input data and write enable 
pins are held low then the RAM can be utilized as a ROM or LUT. 

The ShiftRow transformation is simply hardwired as no logic is involved. The 
MixColumn transformation can be written as a matrix multiplication as given in 
Equation 3, with a 4-byte input, Qq, a 2 , a 2 and output, bo, bj, b 2 , b^. 
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The transformation is implemented by XORing the results of the multiplications in 
GF(2*) in accordance with Equation (3), as illustrated in Fig. 5. 

The flowchart in Fig. 6 outlines the various stages involved in the Rijndael key 
schedule for key lengths of 128, 192 and 256-bits in length. A*, A* and A^ represent 
the key block length, the data block length and the number of rounds respectively. 
The input to the key schedule is the cipher key and key block length and the outputs 
are the Round keys. The Round keys are created as required, hence. Round key [0] is 
available immediately. Round key [1] is created one clock cycle later and so on. The 
various functions utilized in the key schedule are as follows: 



Rem Function 
SubByte Function 

RotByte Function 



Rcon Function 



: Returns the remainder value in a division 

: Operates on a 4-byte word and each byte is passed through 
the Rijndael s-box 

: Involves a cyclic shift to the left of the bytes in a 4-byte 
word. For example, an input of Xo,Xj,X 2 ,Xs, will produce the 
output X],X 2 ,Xs,Xo. 

: Returns a 4-byte vector, Rcon[i] = RC\i\, ‘00’, ‘00’, ‘00’) 
where the values ofRC[i] are outlined in Table 2. 



Table 2. Rijndael Key Schedule Round Constants 



RC[1] = 

‘01’ 


RC[2] = 

‘02’ 


RC[3] = 

‘04’ 


RC[4] = 

‘08’ 


RC[5] 

‘10’ 


RC[6] = 

‘20’ 


RC[7] = 

‘40’ 


RC[8] = 

‘80’ 


RC[9] = 

‘IB’ 


RC[10] = 

‘36’ 
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When utilizing a 128-bit key, 40 words are ereated during expansion of the key and 
every fourth word is passed through the s-box with each byte in the word being 
transformed. Forty 8-bit to 8-bit LUTs and hence 20 BRAMs are required in its 
implementation. Similarly, for a 192-bit key 16 BRAMs are required. With a 256-bit 
key, 26 BRAMs are needed - 14 are utilized for words in positions which are a 
multiple of 8 and a further 12 are used for words in positions which are a multiple of 
(8.i + 4) for 8<i<60. 

Thus, in the overall 128-bit key design, a total of 100 ROMs are required, 80 
ROMs are required for the 10 rounds and a further 20 for the key schedule. Similarly, 
112 ROMS are required for the 192-bit design (96 for the 12 rounds and 16 for the 
key schedule) and 138 for the 256-bit design (1 12 for the 14 rounds and 26 for the key 
schedule). 



3.2 Design of 128-bit Key Rijndael Encryptor/Decryptor Core 

In the decryption operation, the inverse of the ByteSub transformation can also be 
implemented as a LUT. However the values in this LUT are different to those 
required for encryption. Therefore, it is necessary to accommodate for both 
encryption and decryption. One method would involve doubling the number of 
BRAMs utilized, however, this would prove costly on area. In the Rijndael 
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Fig. 6. Rijndael Key Schedule 
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encryption/decryption design, this was overcome by the addition of two BRAMs, 
which were utilized as ROMs, one containing the initialization values for the LUTs 
required during encryption, the other containing the values for the LUTs required 
during decryption. Therefore, instead of initializing each individual BRAM as a 
ROM, when the design is set to encrypt, all the BRAMs are initialized with data read 
from the ROM containing the values required for encryption. When the design is set 
to decrypt, the BRAMs are initialized with data from the ROM containing the values 
required for the decryption operation. This initialization procedure is outlined in 
Fig. 7. 



Round 0 Final Round 




Fig. 7. Initialization of Block RAMs in Rijndael Design 



The Inverse ShiftRow transformation is also hardwired. Multiplexors (MUXs) 
select between the ShiftRow and Inverse ShiftRow wiring. Similarly to Fig. 5, the 
Inverse MixColumn transformation can be implemented by XORing results of the 
multiplications in GF(2*) and again, MUXs are used to select between the values 
required for encryption and those required during decryption. 

Since the encryptor/decryptor core design assumes a key length of 128 bits, the 
design of the key schedule is a simplification of that shown in the flowchart illustrated 
in Fig. 6. During decryption, the values of the LUTs utilized in the key schedule do 
not change, hence, the LUTs can simply be implemented as ROMs. However, the 
round keys are used in reverse order. The initialization process for either encryption 
or decryption takes 256 clock cycles as the 256 values contained in each ROM are 
read. Since the system clock for the encryption/decryption design is 25.3 MHz, this 
corresponds to an initialization time of only 10 |a,s. When encrypting data, the keys are 
produced as each round requires them, therefore, the encryption will take 10 clock 
cycles corresponding to the 10 rounds when using a 128-bit key. The design assumes 
that the same key is utilized during a session of data transfer. If decrypting data, the 
initialization process will be as described above. However, initial decryption will take 
20 clock cycles, 10 clock cycles for the required round keys to be constructed and a 
further 10 corresponding to the 10 rounds. The overall 128-bit key 
encryptor/decryptor design, therefore, requires 102 BRAMs. 
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4 Performance Results 

The Rijndael designs are implemented using Xilinx Foundation Series 3.1i software 
and Synplify Pro V6.0 on Xilinx Virtex-E FPGA deviees. Data bloeks ean be 
aeeepted every eloek eyele and after an initial delay the respeetive 
enerypted/deerypted data bloeks appear on eonseeutive eloek eyeles. The first design 
implemented is the generie eneryptor eore, whieh supports 128-bit, 192-bit and 256- 
bit keys. The performanee results obtained for this implementation will be similar to 
those of a design with only deeryption eapabilities. The main differenee in the two 
implementations would be the initial delay time as mentioned in seetion 3. The 128- 
bit key eneryption design implemented on the Virtex-E XCV812e-8bg560 deviee, 
utilizes 2222 CEB sliees (23%) and 100 BRAMs (35%). Of lOBs 384 of 404 are 
used. The design uses a system eloek of 54.35 MHz and runs at a data-rate of 7 
Gbits/see (870 Mbytes/see). This result proves faster than similar existing FPGA 
implementations, as illustrated in Table 3 below. The implementations ineluded in 
the table are as outlined in seetion 1 . The design is also the most effieient in terms of 
CEB utilization although it must be remembered that the previous implementations 
were limited in their use of deviee. 

Table 3. Specifications of 128-bit Key Rijndael Encryption FPGA Implementations 
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MeLoone et al 
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Both the 192-bit and 256-bit key eneryption designs are implemented on Virtex-E 
XCV3200e-8-egl 156 deviees, as they require a higher number of lOBs than that 
available on the XCV812E deviee. The 192-bit key eneryption design utilizes 2577 
CEB sliees (7%) and 112 BRAMs (53%). Of lOBs 448 of 804 are used. The design 
uses a system eloek of 45.44 MHz and runs at a data-rate of 5.8 Gbits/see (727 
Mbytes/see). The 256-bit key eneryption utilizes 2995 CEB sliees (9%) and 138 
BRAMs (66%). Of lOBs 512 of 804 are used. The design uses a system eloek of 
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39.88 MHz and runs at a data-rate of 5 Gbits/sec (638 Mbytes/sec). There have been 
no FPGA implementations of Rijndael designs capable of supporting 128-bit, 192-bit 
and 256-bit keys published to date. 

The second design implemented is the Rijndael encryptor/decryptor design. On the 
Virtex-E XCV3200e-8-cgl 156 device, this design utilizes 7576 CLB slices (23%) and 
102 BRAMs (49%). Of lOBs 385 of 804 are used. The design uses a system clock of 
25.3 MHz and runs at a data-rate of 3239 Mbits/sec (405 Mbytes/sec). There are no 
known similar single -chip FPGA encryptor/decryptor implementations. Also, the 
results obtained compare very well with existing ASIC implementations, as illustrated 
in Table 4 below. 



Table 4. Specifications of Rijndael ASIC Implementations 





Device 


Throughput 

(Mbits/sec) 


Ichikawa, Kasuya ,Matsui [3] 


CMOS 


1950 


Weeks, Bean, Rozylowicz, Ficke [8] 


CMOS 


5163 


McLoone, McCanny 


XCV3200E 


3239 



It is possible to enhance the performance figures of the two designs presented by 
further optimization of the algorithm specific to the requirements of the FPGA device 
on which the design is to be implemented. However, this would result in the design 
being less easy to migrate to other devices and technologies. 



5 Conclusions 

To conclude, this paper describes high performance single-chip FPGA 
implementations of the Rijndael algorithm. The generic, parameterisable encryption 
design is the only hardware Rijndael encryption design that supports varying key 
sizes, reported to date. When implemented, the 128-bit key encryption design 
performs at a data-rate of 7 Gbits/sec, which is 3.5 times faster than similar existing 
FPGA implementations and 21 times faster than software implementations. Previous 
Rijndael encryption-only designs are implemented on Virtex XCVIOOO devices, 
which consist of only 32 BRAMs and therefore, cannot support a fully pipelined 
Rijndael design. The Virtex-E and Virtex-E Extended Memory family of FPGAs, 
however, contains up to 280 BRAMs and can easily accommodate large unrolled 
designs. The encryptor/decryptor core runs at 3.2 Gbits/sec. This implementation not 
only compares favorably with similar ASIC designs but is also the only known single- 
chip FPGA Rijndael design capable of both encryption and decryption. Future work 
will include parameterising the Rijndael encryption/decryption design so as it may 
also accept varying key sizes. Rijndael is set to be approved by NIST and replace 
DES as the Federal Information Processing Encryption Standard (FIPS) in the 
summer of 2001. It will replace DES in applications such as IPSec protocols, the 
Secure Socket Layer (SSL) protocol and in ATM cell encryption. In general, 
hardware implementations of encryption algorithms and their associated key 
schedules are physically secure, as they cannot easily be modified by an outside 
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attacker. Also, the high speed Rijndael encryptor core and Rijndael 
encryptor/decryptor core presented, should prove beneficial in applications where 
speed is vital as with real-time communications such as satellite communications and 
electronic financial transactions. 
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Abstract. This paper presents an evaluation of the Rijndael cipher, 
the Advanced Encryption Standard winner, from the viewpoint of its 
implementation in a Field Programmable Devices (FPD). Starting with 
an analysis of algorithm’s general characteristics a general cipher struc- 
ture is described. Two different methods of Rijndael algorithm mapping 
to FPD are analyzed and suitability of available FPD families is evalu- 
ated. Finally, results of proposed mapping implemented in Altera FLEX, 
ACEX and APEX FPD are presented and compared with the fastest 
known Xilinx FPGA implementation. Results obtained are significantly 
faster than that of other implementations known up to now. 



1 Introduction 

Since 1997 the National Institute of Standards and Technology (NIST) has 
been working with the international cryptographic community to develop an 
Advanced Encryption Standard (AES). One of requirements given by the NIST 
on AES candidates [I] was the possibility of their efficient hardware implementa- 
tion [2] . Compared with software-based solution, hardware implementation offers 
superior performance and significantly higher system security. Implementation in 
Field Programmable Devices (FPD)^ adds to these two parameters a possibility 
to modify the algorithm in the field. Several papers dealing with implementation 
of AES candidates in reconfigurable hardware have been published so far. Some 
of them give only estimation of these parameters [3] , while others present results 
based on implementation in FPGA [4], [5], [6]. Although some authors (e.g. [4], 
[5]) analyze the possibility to increase the speed using pipeline structures, the 
use of these structures in current cryptography is limited, because they are not 

^ There are several vendors of FPD. These vendors use different names for their 
FPD - e.g. Field Programmable Gate Arrays (FPGA) by Xilinx and Gomplex Pro- 
grammable Logic Devices (GPLD) by Altera. FPD abbreviation is used as common 
name for all of them. 
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suitable for encryption/decryption in the most common feedback modes such 
as Cipher Block Chaining mode, Cipher Feedback Mode, and Output Feedback 
Mode. All of above mentioned papers present results of implementation in Xil- 
inx FPGA (mostly in a high performance Virtex family [7]) and the final AES 
NIST report [2] is based on these results. Some research groups [8], [9], [10], [11] 
presented also results of the implementation of AES candidates in Altera FLEX 
logic devices [12]. In October 2, 2000 NIST has decided to propose Rijndael ci- 
pher [13] as the Advanced Encryption Standard and it is expected that Rijndael 
will be used by U.S. Government and, on voluntary basis, by the private sector. 
Based on this decision our further optimization effort was concentrated to the 
Rijndael algorithm and performance results presented in [9] , [14] have been sig- 
nificantly improved in our new implementations. This improvement is based on 
different method of algorithm mapping, better VHDL encoding and usage of Al- 
tera low-cost ACEX^ [15] and high-performance APEX [16] FPD families. This 
paper evaluates two different methods of the Rijndael cipher implementation 
from the viewpoint of its hardware mapping into high performance Altera FPD 
and it is organized as follows. A brief overview of Rijndael cipher algorithm and 
its basic building blocks is given in Section 2. In Section 3 aspects of proposed 
methods are discussed and different solutions from the viewpoint of the FPD 
embedded memory occupation and speed are presented. The limitations and im- 
plementation results of VHDL design for advanced Altera FPD are described in 
Section 4. In Section 5, the results obtained for both methods are compared and 
some comparisons with the fastest known implementations are made. Finally, in 
Section 6, possible future work is described and concluding remarks are made. 

2 Rijndael Cipher Overview 

2.1 Basic Algorithm Characteristics 

Rijndael is a block cipher using 128, 192 and 256-bit input/output blocks and 
keys [13] . The sizes of data blocks and keys can be chosen independently. Number 
of rounds depends on both of these parameters and it is given in [13]. In the next 
analysis 128 bits for both I/O block and user key are assumed. Therefore, the 
cipher in all presented configurations operates in AC = 10 rounds. 

Encryption and Decryption Algorithms. Encryption and decryption (in 
the following text referred as standard decryption) algorithms for 128-bit in- 
put/output block and 128-bit user key are depicted in Fig. la and Fig. lb, 
respectively. Round keys Aq to Aio are obtained by the expansion of the user 
key, following the algorithm, described bellow. 

As it can be seen, the cipher core is composed mostly of operations that are 
easy to implement in a reconfigurable hardware: byte rotation (permutation), 
byte substitution and bit-wise addition modulo 2 (XOR). The only exception is 
the MixColumn function and its inverse {InvMixColumn function), that involve 

^ Low-cost FPD are optimal for many practical cost-sensitive cryptographic applica- 
tions. 
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Fig. 1. Structure of Rijndael cipher a) encryption algorithm b) standard decryption 
algorithm 



matrix multiplication on 32-bit blocks in Galois field GF(2®). Byte substitution 
operation uses 8 x 8-bit 5'-boxes {byte substitution boxes). There is one type of 
S'-boxes for encryption and another one for decryption. Both of them are applied 
byte-wise on the whole 128-bit block. 

Key Scheduling. Round keys Ki are derived from the user key by means of 
the key schedule. It consists of two components: key expansion and round key 
selection. Total number of round keys is equal to Nr + 1. The key expansion 
algorithm (see Fig. 2) uses bit-wise additions modulo 2 of 32-bit values obtained 
from user key combined with byte substitution, the byte rotation and round 
constants (RCons) addition. The order of round key calculation is the same for 
both encryption and decryption, although decryption uses round keys in reverse 
order. 

Difference between Encryption and Decryption. Standard encryption and 
decryption algorithms use different ordering of basic operations. Moreover, basic 
operations for decryption are different (they implement inverse versions of basic 
encryption operations). As a consequence of this fact resource sharing between 
encryption and decryption logic is very limited. 

Modification of the Order of Operations. In the table-lookup implementa- 
tion it is essential that the only nonlinear step (inverse byte substitution) is the 
first transformation in a round and that the rows are shifted before MixColumn 
is applied. In the standard decryption algorithm inverse byte substitution is the 
last operation in the round. It is shown in [13] that order of inverse byte substitu- 
tion and inverse byte rotation can be changed (both operation are byte oriented). 
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Fig. 2. Key expansion algorithm 




Fig. 3. Round modification for decryption 



This feature is depicted in Fig. 3. Since InvMixCol is a linear transformation, 
the following equation is valid 

InvMixColumn{d © K) = InvMixColumn{d) © InvMixColumn{K) . (1) 

Using properties described above, the standard Rijndael decryption algorithm 
can be transformed into its modified form described in Fig. 4. Comparing Fig. 3 
and Fig. la it can be found that order of operation of encryption and modified 
decryption algorithms is the same (although significance of each operation is 
different). Moreover, round keys have to be inverted by InvMixColumn function 
with the exception of the first and last round keys. 



2.2 Classification of Basic Cipher Operations 
and Choice of Technology 

Rijndael has a relatively simple structure, while most of operations can be easily 
implemented in FPD. Efficient implementation of Rijndael algorithm requires 
ability to implement the following basic cipher operations: 
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Fig. 4. Modified decryption algorithm 



Bit-wise Addition Modulo-2 (XOR). This operation is easily realisable in 
FPD using input lookup table of Logic Element (LE) of Altera FPD or Con- 
figurable Logic Block (CLB) of Xilinx FPD. XOR operation with two to four 
inputs can be implemented in each LE or CLB slice (1/2 of CLB). 

Fixed Rotation (Byte Permutation). Also this operation can be easily 
implemented but in this case routing resources are used. Cell interconnections 
can be reordered in a very simple way to implement rotations in both directions. 
Byte permutation order is different for encryption and for decryption. 

8 X 8-bit 5'-boxes. Rijndael cipher uses 2 types of fixed 8 x 8-bit S-boxes: S'-box 
S'[a;] for encryption and inverse S'-box for decryption. For memory limited 

implementations both S'-boxes can be efficiently computed using the algorithm 
described in [13]. Actual design choice depends on features of FPD family. 8 x 
8-bit S'-boxes should preferably be realised using large embedded memories, 
because combinatorial function would occupy many resources (input LUTs of 
LE or CLB). Dedicated embedded memory blocks are ideal for implementing 
S'-boxes. We have used them in implementations based on Altera devices. 

In general, S-boxes can be implemented as lookup tables using dedicated 
embedded memories or within a set of small memories of LEs or CLBs configured 
as memory elements. Actual design choice depends on features of FPD family. 
8 X 8-bit S-boxes should preferably be realised using large embedded memories, 
because combinatorial function would occupy many resources (input LUTs of 
LE or CLB). Dedicated embedded memory blocks are ideal for implementing 
S-boxes and they were used in implementations based on Altera devices. 

The size of required memory depends on number of bytes that should be 
substituted in one clock period. If the whole 128-bit word should be processed 
in one period, 16 identical 8 x 8-bit S-boxes have to be used for encryption 
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and for decryption. This requires the total memory capacity of 65536^. Since all 
operations of the round can be executed in parallel during one clock period, the 
algorithm can be executed in at least + 1 clock periods. 



MixColumn and InvMixColumn Functions. There is a quite important 
difference between encryption and decryption. MixColumn function 
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where • represents the multiplication in GF(2®) using the primitive polynomial 
m{x) = + x'^ + + 1 and Xi,Yi G GF(2®), is replaced by its inverse - 
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Matrix Multiplications in GF{2^). These operations constitute the main 
obstacle in efficient implementation of this cipher in programmable devices. Au- 
thors of Rijndael propose a method using so called XTime function [13] to solve 
this problem. This 8-bit function can be easily implemented also in FPD and 
the matrix multiplication represents XOR operations applied on the outputs of 
this function. There is also another possibility to realise matrix multiplication. 
Since square matrices in MixColumn (2) and InvMixColumn functions (3) con- 
tain constant elements (polynomials in GF(2®)), it can be shown [17], that this 
multiplication can be replaced by several XOR (©) operations that are simple 
to implement in FPD. For example, operation 

r = 03 • A, for X,Y G GF(2®) (4) 

represents multiplication in GF(2®) using primitive polynomial m{x). It can be 
implemented using following bit-wise XOR operations 

in = XT® Xq VQ = Xq® X5 

2/5 = ® X4^ 2/4 = a;7 © CC4 © x^ , . 

yZ=XT®Xz®X2 V2 = X2®Xi ' ^ ' 

yi = XT ® xi ® a;o yo = xt ® xt ® xq 

This way mat rix multiplication can be replaced by several XOR operations. 

® In some FPD families that include large dedicated embedded memory blocks (e.g. 4 
kbit blocks in Altera FLEX lOKE and ACEX IK) it make no sense to use compact 
S-box and inverse S-box representation based on the affine transform [13]. 
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Key Scheduling. The key scheduling is different for both encryption and de- 
cryption. Encryption round keys are used in normal order and can be computed 
on-the-ffy. During standard decryption encryption round keys are used in re- 
verse order and so they cannot be computed on-the-ffy. For modified decryption 
depicted in Fig. 4, additional InvMixColumn function have to be applied on 
encryption round keys [13]. Round keys can be calculated easily from the user 
key using operations as XOR and rotation on 32-bit data. So the key schedule 
computation is very fast. Since round key preparation for modified decryption 
algorithm is more complex (application of InvMixColumn function on encryp- 
tion round keys), decryption latency (cipher preparation time) could be higher 
than that of encryption. Encryption and decryption use (iV^ -I- 1) 128-bit keys, 
so the RAM capacity should be at least 1408 bits. 

Choice of FPD Technology. From the above analysis it follows that criti- 
cal operations from the point of view of their implementation in FPD are byte 
substitution and matrix multiplication. While fast byte substitution necessitates 
the presence of huge and fast memory blocks, matrix multiplication needs high 
fan-in combinatorial parts and significant count of global interconnections. Al- 
tera FLEX, ACEX and APEX families seem to fulfil better the first condition. 
On the contrary, Xilinx VIRTEX family offers more interconnection flexibility 
and more convenient combinatorial part of CLB (two LUTs per CLB). Since the 
speed of byte substitution operation seemed to be dominant in overall cipher 
speed, we have selected Altera FPD to implement the algorithm. 

3 Methods of Rijndael Mapping to FPD 

The speed and FPD resource requirements of Rijndael cipher mapping depends 
on the method used for actual mapping into available FPD resources. This sec- 
tion analyzes two mapping methods optimized for FPD with large embedded 
memory blocks (EMB), e.g. Embedded Array Block (FAB) in FLEX and ACEX 
devices or Embedded System Block (FSB) in APEX devices. Two types of ci- 
pher core configurations in feedback mode based on basic iterative architecture 
without loop unrolling are assumed: a fast configuration and an economic config- 
uration. For both configurations it is assumed that encryption and/or decryption 
round keys are precomputed and stored in the EMBs. 

The cipher architecture in the fast configuration is shown in Fig. 5. One round 
of the cipher is implemented as a mixture of combinational logic and access to 
EMBs, supplemented with a single register and a multiplexer. In the first clock 
cycle, complete input block of data (128 bits) is fed through a multiplexer, and 
stored in the register. In each subsequent clock cycle, one round of the cipher 
is evaluated, the result is fed back to the circuit through the multiplexer, and 
stored in the register. Therefore encryption and decryption can be made in 11 
clock cycles. 

The cipher architecture in the economic configuration is very similar to that 
in the fast configuration. The only difference is that it uses cipher core with 
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Fig. 5. Cipher architecture in the fast configuration 



resource (especially EMBs) sharing. Internal data block, the 128-bit cipher state, 
is processed in 64 (32) bit sub-blocks in 2 (4) subsequent clock cycles. One round 
of cipher is executed in 2 (4) clock cycles and complete encryption/decryption 
can be made in 22 (44) clock cycles. The economic configuration needs 2 (4) 
times less S'-boxes than the fast configuration. 

3.1 Mapping Based on 8 X 8-Bit S'-boxes 

This approach was used in all known FPD implementations of Rijndael algorithm 
since it has minimal memory requirements. For the fast configuration (see Fig. 5) 
it uses 16 identical 8 x 8-bit S-boxes. Algorithm mapping for encryption is based 
on block diagram described in Fig. la and for decryption on that in Fig. lb. 
It is clear that the logic for encryption and decryption is different and cannot 
be shared. Encryption and decryption S-boxes are also different. Since FAB in 
FLEX lOKE and ACEX IK families contains 4096 bits RAM/ROM bits, two 
S-boxes, one for encryption and one for decryption, occupy exactly one FAB. 
Derivation of the cipher structure in economic configuration is straightforward 
and contains only some additional multiplexers and counters. 

We shall now discuss some aspects of MixColumn and InvMixColumn trans- 
formations implementation. The complexity of these transformations is very dif- 
ferent from the point of view of their implementation in FPD. Each of 32 output 
bits of the MixColumn block is a function of 5 or 7 input bits. On the con- 
trary, InvMixColumn’s output bits depend on 11 to 19 (!) input bits. Since LE 
is optimized for implementation of 4-input logic functions, these large combi- 
natorial functions have to be implemented in several levels, e.g. 5 or 7-input 
functions in two levels and 17 or 19-input functions in 3 levels. This multilevel 
logic slows down significantly the final cipher speed, especially in the decryption 
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logic. We have studied the possibilities to adapt function implementation to the 
structure of available logic cells. While APEX family offers the possibility to 
use high fan-in product term logic, this possibility could not be exploited, be- 
cause product term logic is not suitable for XOR function mapping. Therefore 
we have tried to take advantage of another feature of Altera FPD families - the 
fast carry chain interconnections of neighboring logic cells. Although these inter- 
connections are designed to implement fast arithmetic functions, they can also 
be used for wide logic functions implementations. Advantage of this method is 
that signal transitions via carry chain are several times (up to four times) faster 
than the transitions through complete logic cell. Disadvantage of the method 
lies in the fact that only neighboring cells can be interconnected. Unfortunately, 
matrix multiplication in MixColumn, but especially in InvMixColumn transfor- 
mations represents a huge logic function with a lot of interconnections. For this 
reason, the use of carry chain for multiplication implementation has brought 
some speed improvement (up to 20 %), but it did not attained our expectations. 
Other negative aspect of the use of carry chains is their vendor specific character. 
Nevertheless, we can conclude that the utilization of carry chains in Altera FPD 
stays useful and we use them as often as possible in our cipher implementations. 



3.2 Mapping Based on 8 X 32-Bit T-boxes 

This approach was originally proposed for 32-bit processors [13]. From the point 
of view of memory requirements, it is less attractive than method based on 8 x 8- 
bit S'-boxes, since in the worst case it uses 4-times more embedded memory. This 
is clear disadvantage of this approach. On the other hand, in FPD with 4-kbit 
EABs it uses just 2-times more EABs. Since the high performance FPD (e.g. 
APEX devices) include relatively large embedded memories, these FPD can be 
used for mapping fast cipher configuration based on larger 8 x 32-bit T-boxes. 
Features and advantages of FPD implementation based on T-boxes are described 
in this section. 

T-boxes approach combines S'-boxes and the MixColumn operation for the 
encryption process into four 8 x 32-bit tables [13] 



To[a] 



■ S[a] *02‘ 
S[a] 
S[a] 

S[a] • 03 



Ti[a] 



S[a] • 03 
S[a] • 02 
S[a] 
S[a] 



T2[a] 



S[a] 

S[a] • 03 
S[a] • 02 
S[a] 



Tzia] 



S[a] 

S[a] 

S[a] • 03 
S[a] • 02 



( 6 ) 



These tables with 256 4- byte word entries make up 4 Kbytes of total space. 
Using these tables, the complete round transformation for a 32-bit block can be 
expressed as [13] 



Cj — To[ao,j] © Ti[aij-i] © T2 [02^-2] © T3 [03^-3] © Kj 



( 7 ) 
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where Kj is the round key in round j. Since MixColumn operation is not per- 
formed in the last round of encryption algorithm, the last round have to be 
specially handled: S'-boxes have to be used instead of T-boxes. Fortunately, S- 
boxes can be easily extracted from T-boxes: since all Tja], i = 0, 1,2,3 boxes 
contain in some rows direct S[a] values, we can get substitution result by com- 
bining T-boxes outputs of selected bytes (where S'-box output value has not been 
multiplied by the constants 02 or 03). 

In order to use T-boxes approach for decryption, the cipher structure de- 
scribed in Fig. lb have to be modified. This implementation aspect has been 
anticipated in the design of Rijndael cipher [13]. The modified structure of de- 
cryption algorithm (see Fig. 4) is the same, as the structure of encryption algo- 
rithm, therefore T-box approach shown in Fig. 6 can be directly used also for 
decryption with the exception that new set of inverse T“^-boxes must be used: 



■S'-i[a].0T‘ 




cq 

0 

• 

1 


S'-i[a]. 09 




S'-i[a]*0T 


S'-i[a].0T) 


Tf [a] = 


S'“i[a] • 09 


_S'-i[a].0R_ 




_S-^[a]»0D_ 


■S'-i[a].0T)' 




■S'-i[a].09 ■ 


S'-i[a].0R 




S'-i[a].0T» 


S'-i[a].0T 


^3 [®] = 


S-^[a]»0B 


S'-i[a]. 09 




S~^[a]»0E 



Since none row of T^ j = 0, 1,2,3 contains unmodified S ^[a] values, ex- 
traction of S'“^[a] values from T“^-boxes must be done by multiplication of se- 
lected row by the multiplicative inverse in GT(2®) of the corresponding constant 
(0T“^ = E5, 09“^ = 4T, 0D~^ = El, 0B~^ = CO ) according to equations 

5-i[a] = 09-U[S'-Ma].09] =4T.[S'-i[a]*09] (9) 



similarly 



S-^[a] = T5 • [S-^[a] • OE] = S'-i[a] = GO • [S'"^[a] • OB] (10) 

Multiplication by these constants in GT(2®) can be represented using following 
bit-wise XOR operations 



S~^[x] = E5*X ^ 



Sy ^ = cry © X5 © X4 © X2 © a:i © Xq 
Sg ^ = Xy © X6 © X4 © X3 © X1 © Xo 
Sg ^ = X6 © Xs © X3 © X2 © Xo 
S4 ^ = X5 © X4 © X2 © Xi 
S3 ^ = Xy © X5 © X3 © X2 
83 ^ = Xe © X5 © Xo 
= Xy © X5 © X4 

Sq ^ = Xe © X5 © X3 © X2 © xi © Xo 



( 11 ) 
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=4F*X 



= C0*X 



S7 ^ = X7 © X4 © Xl 
Sg ^ = Xg © X3 © Xq 
S g ^ = Xg © X2 
84^ = X4 © Xl 

S3 ^ = CC7 © X4 © X3 © Xl © Xo 

S^^ = X7 © Xg © X4 © X3 © X2 © Xl © Xq 

S]"^ = Xg © X5 © X3 © X2 © Xl © Xq 

Sq ^ = Xg © X2 © Xq 

57^ = X7 © Xg © X4 © Xl © Xo 
Sg ^ = X7 © Xg © Xg © X3 © Xq 
S g ^ = Xg © Xg © X4 © X2 
54^ = X7 © Xg © X4 © X3 © Xl 
S3 ^ = X3 © X2 © Xl 
S^^ = X7 © Xg © X4 © X2 
S]"^ = X7 © Xg © Xg © X3 © Xl 
Sq ^ = X7 © Xg © X2 © Xl 



(12) 



(13) 



Since equations (11)-(13) enable to get the same S'“^[x] value, all output 
bit values s“^, z = 0, 1, . . . , 7 of <S'“^[x] can be computed as combinatorial logic 
function with maximally 3 logical inputs (chosen from equations (11)-(13) and 
typed in bold face) implemented within one LE. This function is implemented 
in a “Multiplication elimination” block depicted in Fig. 6. 



4 Results of Implementation in Altera FPD 

To map Rijndael algorithm into Altera FPD, the VHDL-based design method- 
ology has been used. It should be stressed, that all presented results have been 
obtained using timing analysis and implementation reports generated by Altera 
MAX+PLUS II v.9.6 and QUARTUS v. 2000. 5 development tools. The results 
of mapping based on 8 x 8-bit A-boxes are summarized in Tables 1-3 and results 
for 8 X 32-bit T-boxes in Tables 4-5. 



Table 1. Fast configuration with 16 S-boxes and 128-bit data blocks in APEX 
20KE200-1 (using 50 ESBs = 98% of total ESB count) 



Logic Elements Used 


Speed (Mbits/s) 


Encrypt 


Decrypt 


Both 


Enc 


Dec 


Both 


LE 


% 


LE 


% 


LE 


% 


1257 


15 


1738 


21 


2493 


30 


964 


694 


612 
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Fig. 6. Cipher architecture based on T-boxes approach 



Table 2. Fast configuration with 16 5'-boxes and 128-bit data blocks in FLEX 
10KE200-1 (using 24 EABs = 100% of total EAB count) 



Logic Elements Used 


Speed (Mbits/s) 


Encrypt 


Decrypt 


Both 


Enc 


Dec 


Both 


LE 


% 


LE 


% 


LE 


% 


1265 


12 


1801 


18 


2530 


25 


570 


505 


451 



Table 3. Economic configuration with 8 S'-boxes and 64-bit data blocks in ACEX 
lKlOO-1 (using 12 EABs = 100% of total EAB count) 



Logic Elements Used 


Speed (Mbits/s) 


Encrypt 


Decrypt 


Both 


Enc 


Dec 


Both 


LE 


% 


LE 


% 


LE 


% 


1461 


29 


2006 


40 


2923 


59 


282 


238 


212 



Table 4. Fast configuration with 16 T-boxes and 128-bit data blocks in APEX 1K400-1 
(using 86 ESBs = 82% of total ESB count) 



Logic Elements Used 


Speed (Mbits/s) 


Encrypt 


Decrypt 


Both 


Enc 


Dec 


Both 


LE 


% 


LE 


% 


LE 


% 


529 


3 


529 


3 


845 


5 


750 


750 


750 
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Table 5. Economic configuration with 4 T-boxes and 32-bit data blocks in ACEX 
1K50-1 (using 10 EABs = 100% of total EAB count) 



Logic Elements Used 


Speed (Mbits/s) 


Encrypt 


Decrypt 


Both 


Enc 


Dec 


Both 


LE 


% 


LE 


% 


LE 


% 


824 


29 


824 


29 


1213 


42 


115 


115 


115 



5 Discussion 

5.1 Comparison of Two Methods of Rijndael Implementation 

Analyzing results given in previous section we can present next advantages of 
the method using A-boxes approach: 

— lower memory requirements for A-boxes implementation, 

— no latency during encryption/decryption changing, 

— very fast encryption, but significantly slower decryption. 

As disadvantages of this method we can name: 

— low-level of resources sharing, 

— high count of logic elements used. 

The second method based on T-boxes brings following advantages: 

— faster overall cipher speed (for both encryption and decryption), 

— high level of resources sharing, due to the symmetry of encryption/decryption 

— very few logic elements used, because matrix multiplication is realized using 
look-up tables. 

The disadvantages of the second method are: 

— relatively high latency when changing encryption to decryption and vice 
versa - T-boxes have to be generated from S-boxes stored in one EMB (this 
latency can be reduced to zero, if double amount of EMBs is used), 

— double (or quadruple) memory needs for T-boxes implementations (one T- 
box has 8 kbits, while one S'-box has 2 kbits). 

We can conclude that the first method could be better for applications, where 
only encryption algorithm is used. On the contrary, the second method should 
give better results if both encryption and decryption have to be fast. In the 
economic version (where commutation latency is acceptable) T-boxes can be 
computed from S'-boxes stored in one EMB after each direction commutation. In 
the fast version separate T-boxes can be used for both encryption and decryption. 
This will reduce commutation latency to zero. 
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5.2 Comparison with Known FPD Implementations 

Several Rijndael cipher implementations have been published up to now. Table 6 
gives the FPD implementation results of encryption/decryption speed in the 
feedback cipher mode published in [4] - ELB, [6] - DAN, [8] - GAJ and [10] 
- MUT. For comparison NSA implementation in 0.5 /im ASIC [8] is included 
as well. Figure 7 compares known implementations in low-cost Altera FPD. It 
can be seen, that our 16 S-boxes implementation is the fastest implementation of 
Rijndael cipher in low-cost Altera FPD. T-boxes approach permits to implement 
the Rijndael cipher in as small circuit as ACEX 1K50, leaving almost 60 % of 
resources free! As it can be seen in Fig. 8, the encryption/decryption in our fast 
configuration based on T-boxes implementation is more than 80 % faster than 
the fastest FPD implementation known to us. It can also be seen that S'-boxes 
approach for comparable Altera FLEX and Xilinx VIRTEX families gives similar 
results. 



Table 6. Results of Rijndael implementations 



Logic Elements Used 


Speed Mbits/s 


Fast (T-boxes, 128 bit blocks) 


750 


Fast (S-boxes, 128 bit blocks) 


612 


NSA 


606 


GAJ 


414 


DAN 


353 


ELB 


300 


MUT 


248 


Economic (S-boxes, 64 bit blocks) 


212 


Economic (T-boxes, 32 bit blocks) 


115 



6 Conclusions 

In this paper we have evaluated the Rijndael cipher from the point of view of its 
implementation in reconfigurable hardware. The implementation results given in 
the previous sections depend significantly on the used FPD family. The Altera 
ACEX FPD have been found to be an excellent solution for very fast Rijndael 
cipher implementation in the reconfigurable hardware. Presented new solution 
based on T-boxes allows implement Rijndael cipher with the same high speed 
of encryption and decryption. On the other side, low-cost ACEX FPD family 
is suitable for cost-sensitive encryption applications. Future development will 
include integration of circuits for key exchange based on public-key schemes. 
Although current implementation uses only 128-bit keys, extension to larger 
keys (192 and 256 bits) requires just minor algorithm modifications and allows 
reach higher security with only minimal additional development effort. 
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MBits/s 



500 -- 
400 -- 
300 -- 
200 -- 
100 -- 



451 



TSI/ 

TUKE 



316 



2 



GMU 



TSI/TUKE - TSI France & Technical University 
of Kosice, Slovakia 

GMU - George Mason University, USA 
MUT - Military University of Technology, 
Poland 



248 



3 



MUT 



Implementations: 

1) 16 S-boxes in FLEX 10K200E 

2) 16 S-boxes in FLEX 10K130E 

3) 16 S-boxes in FLEX 10K250 

4) 8 S-boxes in ACEX IKIOOE 

5) 4T-boxesinACEX 1K50E 



TSI/ TSI/ 
TUKE TUKE 



Fig. 7. Comparison of known Rijndael cipher implementations in Altera FLEX lOK 
and ACEX IK FPD 



MBits/s 

700 

600 

500 

400 

300 

200 

100 



750 



612 



606 



TSI/ TSI/ NS A 
TUKE TUKE 



TSI/TUKE - TSI France & Technical University 
of Kosice, Slovakia 

NSA - National Security Agency, USA 
GMU - George Mason University, USA 



451 



TSI/ 

TUKE 



414 



GMU 



Implementations: 

1) 16T-boxes in 0,18 pm APEX 20K400E 

2) 16 S-boxes in 0,18 pm APEX 20K200E 

3) 16 S-boxes in 0,5 pm ASIC 

4) 16 S-boxes in 0,22 pm FLEX 10K200E 

5) 16 S-boxes in 0,22 pm VIRTEX XCVIOOO 



Fig. 8. Comparison of fastest known Rijndael cipher implementations in feedback mode 
for different FPD and ASIC 
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Abstract. In this paper we explore pseudo-random nnmber generation 
on the IBM 4758 Secure Crypto Coprocessor. In particular we com- 
pare several variants of Gennaro’s provably secure generator, proposed 
at Crypto 2000, with more standard techniques based on the SHA-1 com- 
pression fnnction. Onr results show how the presence of hardware support 
for modular multiplication and exponentiation affects these algorithms. 



1 Introduction 

The use of cryptographic techniques is a key element of modern e-business ap- 
plications. Such applications use cryptography in a variety of ways to protect 
the privacy and confidentiality of data, to ensure the integrity of data, and to 
provide user accountability through digital signature techniques. 

The security of cryptographic algorithms in real life applications, however 
relies mostly on two main assumptions: 

1. that the secret keys used in the algorithms have not been compromised, 

2. that the code executing the algorithms is really performing the tasks that it 
is supposed to. 

Thus, in real life there is a concrete need to address these issues: the physical 
security of the keys and the code used in cryptographic algorithms. This is 
why most of the time, the keys are stored in a secure, protected memory device 
which is not easily tampered with. Similarly the code must be run in a protected 
environment. One answer to these issues is to use a secure coprocessor. 

A secure coprocessor is a device that offloads computationally intensive cryp- 
tographic processes from the hosting server, and performs sensitive tasks unsuit- 
able for less secure general purpose computers. Depending on the applications, 
it may be a special-purpose computational engine (say a hardware RSA chip), or 
it may be more useful to have a general-purpose computing environment. Such 
a device must withstand physical and logical attacks; it must run the programs 
that it is supposed to, unmolested. The host server should be able to (remotely) 
distinguish between the real device and a possible impersonation. The coproces- 
sor must remain secure even if adversaries carry out destructive analysis of one 
or more devices. 
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An important class of secure coprocessors are the so-called field programmable 
ones, which allow the user to write custom software for the device, which then 
loads it under some controlled condition and subsequently runs it. 

In this paper we consider the IBM 4758 PCI Secure Crypto Coprocessor, 
which is an example of such a field programmable device [12]. The IBM 4758 is 
the only programmable device on the market which has been certified at FIPS 
140-1 Level 4, the highest security classification for a commercial cryptographic 
device [8]. We elaborate more on the technical specifications of the IBM 4758 in 
Section 2. 

We report the implementation results, on the IBM 4758, of a random number 
generator recently proposed at CRYPTO’2000 [5]. 



1.1 The Problem of Pseudo-random Bit Generation 

Many, if not all, cryptographic algorithms rely on the availability of truly random 
bits. However perfect randomness is a scarce resource. Fortunately for almost all 
cryptographic applications, it is sufficient to use pseudo-random bits, i.e. sources 
of randomness that “look” sufficiently random to the adversary. 

This notion can be made more formal. The concept of cryptographically 
strong pseudo-random bit generators (PRBG) was introduced in papers by Blum 
and Micali [3] and Yao [14]. Informally a PRBG is cryptographically strong if it 
passes all polynomial-time statistical tests or, in other words, if the distribution 
of sequences output by the generator cannot be distinguished from truly random 
sequences by any polynomial-time judge. 

A PRBG is called provably secure, if its security can be reduced to a well- 
established conjectured hard problem (like factoring or computing discrete log- 
arithms.) 

[5] assumes a variation of the Discrete Log Assumption. More specifically 
it assumes that if solving the discrete log problem modulo an n-bit prime p is 
hard even when the exponent is small (say only c bits long with c < n), then 
the function / : {0, 1}'^ — >■ Z* defined as f{x) = mod p has strong pseudo- 
randomness properties over Z*. In particular it is possible to think of it as a 
pseudo-random generator itself. By iterating the above function and outputting 
the appropriate bits, an efficient pseudo-random bit generator is obtained. The 
generator outputs n — c — 1 bits per iteration, which consists of a single expo- 
nentiation with a c-bit exponent. 

An attractive feature of this generator is that all the exponentiations are 
computed over a fixed basis, and thus precomputation tables can be used to 
speed them up. 

Using typical parameters n = 1024 and c = 160 we obtain roughly 860 
pseudo-random bits per 160-bit exponent exponentiations. Using the precompu- 
tation scheme proposed in [7] one can show that such exponentiation will cost 
on average roughly 40 multiplications, using a table of only 12 Kbytes. Thus we 
obtain a rate of more than 21 pseudo-random bits per modular multiplication. 
Different tradeoffs between memory and efficiency can be obtained. 
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1.2 Interesting Questions and Our Results 

When we started this implementation project we had the following questions 
which we thought were worth investigating: 

~ The IBM 4758, as many other crypto coprocessors, provides hardware sup- 
port for modular math operations (modular multiplications and exponenti- 
ations). How effective are precomputation techniques like [7] in the presence 
of hardware support? Is the extra storage worth the potential gain in speed? 

— The generator proposed in [5] is the fastest provably secure PRBG in the 
literature, based on established number theoretic conjectures. It would be 
interesting to know how it compares to other PRBGs whose security is as- 
sumed “from scratch” since they are related to block ciphers and hash func- 
tions. In particular it is interesting to see the results of this comparison in a 
constrained computing environment like a secure coprocessor. 

For the first question, we ran the algorithm with various settings of the [7] 
precomputation scheme, as well as with no precomputation at all. In the latter 
case, modular exponentiations were computed completely in hardware, while in 
the former case the dedicated hardware was invoked only for modular multipli- 
cations. Quite surprisingly we obtained timing results that showed no increase 
in speed with the use of precomputation tables. Actually the algorithm was sub- 
stantially slowed down. This seems to indicate that hardware support for mod- 
ular exponentiations totally eliminates the need for precomputation schemes. 

For the second question, we ran the [5] generator against an implementation 
of a pseudo-random generator consistent with the ANSI X9.I7 Key Management 
standard^. This generator is based on repeated application of the hash function 
SHA-1. The timing results show that it is still considerably more efficient than 
our number theoretic construction (but, as mentioned above, this is at the cost of 
not being able to be proven to be reducible to any (supposed) hard mathematical 
problem) . 



2 The IBM 4758 Architecture 

The IBM 4758 Secure Grypto Goprocessor is a hardware card, that plugs into 
industry-standard PGI slots in personal computers and other systems that sup- 
port the PGI bus. The Goprocessor secure processing environment contains a 
486-compatible microcoprocessor, custom hardware to perform DES and public 
key cryptographic algorithms, a secure clock/calendar, and a hardware random 
number generator. See Figure 1 for a complete list of specifications. 

It also has protective shields, sensors, and control circuitry to protect against 
a wide variety of attacks against the system. More specifically the 4758 is pro- 
tected against attacks involving probe penetration, power sequencing, radiation 

^ In fact it is the implementation that is used by the card itself for pseudo-random 
number generation. 
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Features: 


Card type: 

Internal Processor: 
RAM: 

ROM/FLASH: 
Battery-backed RAM: 


PCI 32-bit Bus Master 
486 99MHz 
4 Mbytes 
4 Mbytes 
32 Kbytes 


Crypto: 


DBS: 


Hardware support 




RSA, DSS: 


Software with hardware support 
for 1024-bit modular math 




Hashing: 


SHA-1 in hardware 




Random numbers: 


Noise-based hardware RNG 



Fig. 1. Features of the IBM 4758 

and temperature manipulation, consistent with the FIPS 140-1 Level 4 Certifi- 
cation. The basic element of the protective layer is a grid of conductors which 
is monitored by circuitry that can detect changes in the properties of the con- 
ductors. The conductors themselves are non-metallic and closely resemble the 
material they are embedded in. This makes discovery, isolation and manipulation 
all the more difficult. These grids are arranged in several layers and the sensing 
circuitry can detect accidental connections between layers as well as changes 
in an individual layer. The sensing grids are made of flexible material and are 
wrapped around and attached to the secure processor package as if it were being 
gift-wrapped. After the package is wrapped, it is embedded in a potting material 
(which as mentioned closely resembles the conductors). Finally the entire pack- 
age is enclosed in a grounded shield to reduce susceptibility to electromagnetic 
interference and to reduce detectable electromagnetic emanations. 

During the final manufacturing step, the Coprocessor generates a unique 
public key pair, which is stored in the device. The tamper detection circuitry 
is activated at this time and remains active throughout the useful life of the 
Coprocessor, protecting this private key, as well as all other keys and sensitive 
data. The Coprocessor public key is certified at the factory by a global IBM 
private key and the certificate is retained in the Coprocessor. Subsequently, the 
Coprocessor private key is used to sign the Coprocessor status responses which 
in conjunction with the public key certificate, demonstrate that the Coprocessor 
remains intact and is genuine. 

From the time of manufacture, if the tamper sensors are ever triggered, the 
Coprocessor zeroizes its critical keys, destroys its certification, and is rendered 
inoperable. 



2.1 Developing Applications for the 4758 

The Coprocessor contains firmware to manage its specialized hardware and to 
control loading of additional software. The card runs the IBM CP/Q embedded 
operating system, which has been extended with device drivers and other features 
specific to the Coprocessor. The resulting control program, C'P/(5++, provides 
the platform for application development. A complete custom application (like 
our pseudorandom generator) can be built on the CP/Q'^^ environment. 
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During development, the security features of the 4758 and the public key sig- 
natures used to validate download requests are irrelevant, but enabling symbolic 
debugging capability by adding a debug probe to CP/Q++ is essential. Prepar- 
ing the 4758 for development is a one-time process. This step allows an external 
party to identify a 4758 by means of the identification of the officer assigned 
to the operating system layer. It is important in the overall picture of security 
provided by a 4758, in that a card with debug capability cannot masquerade as 
a secure card to the external world. 

Application code is written in C, with one portion destined for the 4758 and 
the other its partner on the host machine. The 4758-based software is cross- 
compiled using supplied headers. After the normal link step, there are a few 
additional steps: 

1 . translation from host-native executable format to the format accepted by the 
CPjQ^^ loader, as well as translating debug symbols to a format understood 
by the symbolic debugger supplied with the Toolkit 

2. packing the translated executable into a disk image for the read-only file 
system within the 4758, used by CPjQ^^ 

3. downloading the disk image 

The last, download, step is a bit lengthy in that the 4758 must be rebooted in 
order to open the hardware locks that protect flash to enable writing of code, 
and the hardware is tested each time the 4758 is reset (essential parts of the 
security architecture) . 

After development has completed, software can be deployed using any of the 
host platforms for which a device driver is available (includes AIX, OS/2, Linux, 
others). The development Toolkit is available for NT, with a version hosted in 
Linux to appear shortly. 



3 The New Pseudorandom Generator 

In this section we briefly recall the [5] generator. 

Number-Theoretic Preliminaries. Let p be a prime. We denote with n the 
binary length of p. It is well known that Z* = {x : 1 < x < p — 1} is a cyclic 
group under multiplication modp. Let ghe a generator of Z*. Thus the function 

/ : ^ z; 

f{x) = modp 

is a permutation. The inverse of / (called the discrete logarithm function) is 
conjectured to be a function hard to compute (the cryptographic relevance of 
this conjecture first appears in the seminal paper by Diflie and Heilman [4] 
on public- key cryptography). The best known algorithm to compute discrete 
logarithms is the so-called index calculus method [1] which however runs in time 
sub-exponential in n. 
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In some applications (like the one we are going to describe in this paper) 
it is important to speed up the computation of the function f{x) = g^. One 
possible way to do this is to restrict its input to small values of x. Let c be a 
integer which we can think as depending on n (c = c(n)). Assume now that we 
are given y = mod p with a; < 2°. It appears to be reasonable to assume that 
computing the discrete logarithm of y is still hard even if we know that x < 2°. 
Indeed the running time of the index-calculus method depends only on the size 
n of the whole group. Depending on the size of c, different methods may actually 
be more efficient. Indeed the so-called baby-step giant-step algorithm by Shanks 
[6] or the rho and lambda algorithms by Pollard [10] can compute the discrete 
log of y in 0(2°/^) time. If one restricts the field to generic algorithms (i.e. 
algorithms that can only perform group operations and cannot take advantage 
of specific properties of the encoding of group elements) then Schnorr in [11] 
proves that this is the best that can be done. 

If the complete factorization of p — 1 is known, then the running time of these 
algorithms can be improved by using the Pohlig-Hellman decomposition [9] . This 
is done by reducing the original discrete log problem, into several “smaller” 
problems (one for each distinct prime factor in p — 1). 

Van Oorschot and Wiener in [13] present a new method of combining the 
Pollard lambda method with a partial Pohlig-Hellman decomposition. Their end 
result is that for random primes, using short exponents is not secure. However 
their attack can be avoided by restricting the moduli to be safe primes p (i.e. such 
that is also a prime) since in this case the Polhig-Hellman decomposition is 
useless. 

Thus if we set c = o'(logn), there are no known polynomial time algorithms 
that can compute the discrete log of y = g^ mod p when x < 2'^ and p is a 
safe prime. One can explicitly assumed that no such efficient algorithm can 
exist. This is called the Discrete Logarithm with Short c-Bit Exponents (c-DLSE) 
Assumption and we will adopt it as the basis of our results as well. 



Assumption 1 (c-DLSE) Let SPRLMES{n) be the set of n-bit safe primes 
and let c be a quantity that grows faster than logn (i.e. c = w(logn ) ). Eor every 
probabilistic polynomial time Turing machine I, for every polynomial P(-) and 
for sufficiently large n we have that 



Pr 



p ^ S PRIME S{n)- 
X ^ — Rc\ 

i{p,9.g'',c) = X 



1 

“ P{n) 



In practice, given today’s computing power and discrete-log computing al- 
gorithms, it seems to be sufficient to set n = 1024 and c = 160. This implies a 
“security level” of 2®° (intended as work needed in order to “break” 160-DLSE). 



3.1 The Algorithm 

Consider the following function: 

RG„,e : Zp_i ^ Z; RG„,e(s) = 5^* mod p 
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That is we consider modular exponentiation in Z* with base g, but only after 
zeroing the bits in positions 2, . . . , n — c of the input s (these bits are basically 
ignored) . 

The function RG induces a distribution over Z* in the usual way. We denote 
it with RGn,c the following probability distribution over Z* 

ProbRG„^^[y\ = Prob[y = RG„,c(s) ; s ^ 

It is possible to prove (see [5]) that the distribution RGn,c is computationally in- 
distinguishable from the uniform distribution over Z* if the c-DLSE assumption 
holds. 

It is now straightforward to construct the new generator. The algorithm 
receives as a seed a random element s in Zp_i and then it iterates the function 
RG on it. The pseudo-random bits outputted by the generator are the bits ignored 
by the function RG. The output of the function RG will serve as the new input 
for the next iteration. 

In more detail, the algorithm IRG„^c (for Iterated-RG generator) works as 
follows. Start with €r Zp-i. Set = RG„,c(a^*-*~^^)- Set also = 
^2 \ Xg \ . . . , The output of the generator will be . . . , where 

k is the number of iterations (chosen such that k = poly(n) and fc(n — c— 1) > n). 

Notice that this generator outputs n — c — 1 pseudo-random bits at the cost 
of a modular exponentiation with a random c-bit exponent (i.e. the cost of the 
computation of the function RG). 

3.2 Efficiency Analysis 

Let’s fix n = 1024 and c = 160. With these parameters we can safely assume 
that the complexity of the best known algorithms to break c-DLSE is beyond 
the reach of today’s computing capabilities. 

We obtain 863 bits at the cost of roughly 240 multiplications, which yields 
a rate of about 3.5 bits per modular multiplication. The most expensive part 
of the computation of our generator is to compute mod p where s is a c-bit 
value. 

We can take advantage of the fact that the modular exponentiations are all 
computed over the same basis g. This feature allows us to precompute powers 
of g and store them in a table, and then use this values to compute fastly g“ for 
any s. 

Lim and Lee [7] present flexible trade-offs between memory and computation 
time to compute exponentiations over a fixed basis. Their approach is applicable 
to our scheme as well. In short, the [7] precomputation scheme is governed by 
two parameters h^v. The storage requirement is (2^ — l)c elements of the field. 
The number of multiplications required to exponentiate to a c-bit exponent is 
Tf 1 + r Jm \ ~ 2 in the worst case. 

Using the choice of parameters for 160-bit exponents suggested in [7] we can 
get roughly 40 multiplications with a table of only 12 Kbytes. This yields a 
rate of more than 21 pseudo-random bits per multiplication. A large memory 
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implementation (300 Kbytes) will yield a rate of roughly 43 pseudo-random bits 
per multiplication. 



4 Implementation Timing Results 

We ran a C implementation of the above generator on the IBM 4758 card using 
the implementation procedures described in Section 2.1. In particular this means 
that we used the 4758 native hardware support for modular exponentiations and 
modular multiplications. 

We first ran 1024 iterations of the generator (i.e. an output of 863 Kbits) 
without using precomputation tables. The task took approximately 4.75 sec- 
onds, which implies a rate of 22.7 Kbytes/sec. Thus, for example, this is the 
rate at which two secure coprocessors can securely encrypt data (via symmetric 
encryption) under a strong mathematical guarantee of security. 

We then ran the algorithm using the [7] precomputation scheme with var- 
ious settings of the parameters h, v described above. The experimental results 
confirmed the theoretical speed-ups between different choices of the parameters, 
however they also demonstrated a major slowdown of the algorithm compared 
to the case in which we computed the whole exponentiation in hardware. 

The explanation is that the overhead of invoking in software the hardware 
chip for modular multiplication several times, offset whatever gain we could 
obtain in decreasing the number of multiplications by use of precomputation 
tables. 

The results are summarized in Figure 2. 



{h,v) 


Storage (Kbytes) 


Time (sec) 
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4.75 
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32 


79.33 


(8,2) 


64 


68.4 
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160 
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o 

r—H 


320 


50.82 


(10,4) 


512 


46.41 


(10,8) 


1 Mbyte 


42.57 



Fig. 2. Timing Results 

These can be compared to the SHA-1 based implementation, which took 
1.22 seconds to produce a similar 863 Kbit block of pseudo-random data. This 
implementation is written in highly optimised C code; in fact this is the code 
that the CP/Q operating system itself uses to generate pseudo-random data. 
However we do note that we ran the code as a standard “loaded-in” application, 
just as the number theoretic generator was, to enable a fair comparison. 

Another useful comparison is to the BBS generator (see [2]), where one ob- 
tains at least 1 bit (and at most^) about 4 bits) of pesudorandom data from 

^ This has to do with assumptions on the hardness of factoring; see [5] for more details. 
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each modular squaring. Theoretically this should be similar to our exponenti- 
ation method (see [5] for a more rigorous comparison), however the overhead 
of calling the modular math hardware adversely affects this generator. In fact 
it takes 2.3 seconds to generate just a 1Kbit block of data using this approach 
(assuming one bit per exponentiation). 

5 Conclusions 

The results show that the SHA-1 based pseudorandom number generation is 
still considerably faster than the one based on discrete logarithms. However the 
difference, a factor of less than 4 on this hardware, may be considered not too 
high a price to pay by some who wish to have a “provably secure” , rather than 
a “seemingly secure” (i.e. one that has withstood cryptographic attack thus far) 
system for pseudorandom number generation. 

It should be stressed however that this result is strongly reliant on the fact 
that the algorithms were tested on the IBM 4758 secure coprocessor, which 
has support for hardware modular exponentiation. All of the software-based 
exponentiation variants of [5] that we tried were considerably slower (another 
factor of 10 to 20), even though they made use of hardware support for modular 
multiplication, and used precomputed tables. 

The discrepancy was even more significant with the BBS generator due to 
the low output rate of the generator for each call to the modular math hard- 
ware; it turned out to be between 100 and 400 times slower than the “pure 
exponentiation” generator on this hardware. 
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Abstract. General problems and difficulties are discussed which have 
to be considered when testing trne random numbers. Requirements are 
formulated which appropriate online tests should fulfill. Then we propose 
an online test procednre which meets these reqnirements. 

Keywords: True random number generator, statistical test, online test. 



1 Introduction 

Random numbers play an important role in many cryptographic applications. 
Random numbers are used, for instance, to generate random session keys, signa- 
ture parameters and challenges for challenge-response protocols and zero knowl- 
edge proofs. Roughly speaking, the class of random number generators can be 
divided in three subclasses. First, there are true (physical) random number gener- 
ators. Usually, an analog signal generated by a physical noise source is digitalized 
after uniform time intervals, e.g. by a comparator. Many true random number 
generators use a mathematical follow-up-treatment, i.e. an algorithm applied on 
the digitalized analog signals. The goal of a mathematical follow-up treatment 
is to reduce or at least to mask weaknesses of the digitalized analog signals. 
In contrast, pseudorandom number generators derive (pseudo-)random numbers 
deterministically from a randomly chosen seed. Pseudorandom number genera- 
tors are very cheap as they merely require some additional lines of code. Their 
drawback is that the whole entropy is “contained” in the seed. Finally, there 
are “mixed” generators which derive random numbers from user’s interaction 
(mouse movement or key strokes) or register values of the used PC. 

In the following we restrict our attention to true random number generators 
(TRNGs). We denote the digitalized analog signals as das-random numbers and 
the values after the mathematical follow-up treatment has been applied on as 
internal random numbers. Upon an external call the TRNG outputs internal 
random numbers. 

Obviously, non-appropriate random numbers can weaken strong cryptographic 
mechanisms considerably. To assess a TRNG a mathematical model of the physi- 
cal noise source should be evaluated and analyzed, and suitable (that is, suitable 
with respect to the mathematical model) statistical tests should be applied on 
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the das-random numbers generated by some TRNGs prototypes. A lot of re- 
search work has been devoted to the generation of good physical noise sources 
and the determination of suitable statistical tests ([2], [10], [12] etc.). 

Tolerances of the components of the random noise sources may cause that 
a particular TRNG produces worse das-random numbers than the carefully in- 
vestigated prototypes. Further, ageing of these components may also affect the 
statistical quality of the generated das-random numbers. As a consequence sta- 
tistical tests (“online tests”) have to be executed while the TRNG is in operation 
to ensure that the generated random numbers are appropriate. Especially, if the 
TRNG is integrated in a smart card online tests should run fast, require only 
few lines of additional code and little memory. 

Section 2 considers the question whether the das-random numbers or the 
internal random numbers should be tested and in Sect. 3 general demands are 
formulated which online tests should fulfill. In Sect. 4 we briefiy discuss the 
drawbacks of a widely used online test. In Sects. 5 - 9 a new online test procedure 
is described, analyzed and illustrated at two examples. The paper ends with final 
remarks. 

2 Which Random Numbers Should Be Tested? 

If there is no mathematical follow-up treatment the das-random numbers co- 
incide with the internal random numbers. Otherwise, the online tests can be 
applied on the das-random numbers or, alternatively, to the internal random 
numbers. 

For many cryptographic applications it is inevitable that random numbers 
cannot be determined or guessed with a reasonable probability, even if prede- 
cessors or successors are known. Pseudorandom number generators rely on the 
complexity of their algorithms which shall ensure practical security (see, e.g. [1]). 
For TRNGs the situation is much more comfortable as the total entropy of a 
das-random number sequence increases per generated das-random number. If the 
increase of entropy is sufficiently large, this ensures theoretical security. (Glearly, 
a lucky attacker could guess a randomly chosen session key, for instance, but if 
the key length is sufficiently large his success probability is negligible.) 

Hence it is desirable to control the increase of entropy. Unfortunately, entropy 
is not a function of random numbers but of random variables. In the following 
we will interpret das-random numbers as values assumed by random variables 
whose distribution usually is at least not exactly known. We use statistical tests 
to compare das-random number sequences with sequences generated by ideal 
random number generators. 

Remark 1. In [3] a variant of Maurer’s “universal” statistical test (cf. [11], [4]) is 
introduced. Its test value is closely related to the entropy per bit block provided 
that the bits were generated by a stationary binary random source with Suite 
memory. If this is not the case, however, as for Maurer’s test the test value need 
not yield a reliable estimate of the entropy. For pseudorandom bits generated 
by a linear feedback shift register, for example, the increase of entropy per bit 
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equals zero whereas the test value “suggests” a considerable amount of entropy 
per bit block. Moreover, as Maurer’s test it requires a lot of memory and gigantic 
sample sizes. Hence both tests are not suitable as online tests but may be used 
for the investigation of TRNG prototypes. 

Example 1. Let the TRNG produce binary das-random numbers and let a linear 
feedback shift register of length 63 with primitive feedback polynomial be syn- 
chronized with the digitalization of the analog noise signal. In each time step the 
feedback shift register delivers an internal random number (a single bit). The 
actually generated das-random number is XOR-ed to the feedback value and the 
sum is fed back into the shift register. 

For each initial value of the feedback shift register this mathematical follow- 
up treatment is a one-to-one mapping and thus cannot increase the average 
entropy per bit. Statistical weaknesses of the das-random numbers are not re- 
duced but only transferred into others. If, for example, the das-random numbers 
are independent but not equidistributed (i.e., if the probability for “0” is not 

0. 5) the internal random numbers are essentially equidistributed but dependent. 
Unless its linear complexity profile is tested applying statistical tests on the in- 
ternal random number sequence will presumably even not detect the worst case, 

1. e. if the physical noise source has totally broken down. In fact, after this mo- 
ment the das-random numbers are constant and the internal random numbers 
are generated deterministically. 

This brief analysis of Example 1 has revealed an important fact: The internal 
random numbers may pass certain statistical tests which the das-random num- 
bers do not. However, this does not necessarily imply that the mathematical 
follow-up treatment reduces statistical weaknesses of the das-random numbers. 
Maybe they are merely masked and transformed into others. An increase of the 
entropy per bit can only achieved by a data compression which in turn lowers the 
bit rate. (In the simplest case non-overlapping bits are XORed.) Of course, the 
das-random numbers may not be equidistributed and there may exist dependen- 
cies on preceding das-random numbers but in contrast to the internal random 
numbers there will not exist complicated algebraic dependencies. Gonsequently, 
the das-random numbers but not the internal random numbers should be tested, 
especially if the TRNG is used for sensitive applications. 

3 Which Requirements Should Online Tests Fulfill? 

As motivated in the previous section online tests should be applied on das- 
random numbers which usually (but not necessarily) are single bits. 

Definition 1. A realization of a random variable A is a value assumed by 
X. The term iid abbreviates “independent and identically distributed”. We 
call a random variable binary if it only assumes the values 0 and 1. A ran- 
dom variable X is called equidistributed on a finite set fl := {wi,...,a;fe} if 
Prob(X = Wj) = 1/k holds for all j < k. Applying a statistical test on a sample 
delivers a numerical value called test value or test statistic. 
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Mathematical Model and Definitions. Das-random numbers assume val- 
ues in 12das (usually, but not necessarily, l7das = {0,1})- We assume that the 
das-randon^ numbers are realizations of random variables Bi, B 2 , ■ ■ ■ so that the 
test value T itself may be interpreted as a realization of a random variable T, 
the so-called test variable. The distribution of _Bi, i? 2 , • ■ • and that of T clearly 
depend on the particular TRNG. For an ideal random number generator (a fic- 
tion!), of course, the random variables i?i, B 2 , ■ ■ ■ are iid and equidistributed on 
f?das- To avoid clumsy formulations we call bit sequences generated by an ideal 
random number generator ideal sequences. A y^-distribution with k degrees of 
freedom is denoted with xi- 

We will use statistical tests to compare das-random numbers with ideal se- 
quences. To a certain degree statistical deviations of the das-bits from ideal 
sequences, however, are tolerable. (Assume, for example, that the das-bits were 
realizations of iid random variables Bi,B 2 , . . . with Prob(Bj = 1) = 0.49. Then 
the average entropy per das-bit was about 0.9997.) Clearly, “tolerable” essen- 
tially depends on the intended applications for which the random numbers shall 
used and, to a certain degree, on the mathematical follow-up treatment which 
may increase the entropy per random number (cf. Sect. 2). If the statistical prop- 
erties of the das-random number sequence deviate too much from that of ideal 
sequences the online test should generate a noise alarm. The preceding consider- 
ations suggest various requirements which online tests should fulfill. Recall that 
due to tolerances of components or ageing effects a TRNG may produce worse 
das-random number sequences than the carefully investigated TRNG prototypes. 
Clearly, also ideal sequences would occasionally fail statistical tests. 

Requirements for online tests. 

(Rl) An online test has to detect a total breakdown of the noise source very soon. 
(R2) An online test should detect non-tolerable statistical weaknesses of the das- 
random numbers. 

(R3) The probability for a noise alarm should be small if the deviation of the 
statistical properties of the das-random numbers from that of ideal sequences is 
tolerable. 

(R4) An online test should run fast and require only a few lines of code and little 
memory. 

4 Drawbacks of a Widely Used Online Test 
and General Difficulties 

In this section we discuss briefly problems of a near-at-hand online test procedure 
which is widely used in practice. 

Example 2. Let the TRNG generate binary das-random numbers. A FIFO stores 
internal random numbers. If the FIFO is full the generated internal random 
numbers are neither used nor stored. Upon external request, the FIFO outputs 
internal random numbers. Periodically every minute and whenever the FIFO has 
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to be refilled n ^^320 consecutive das-bil^are segmented into 80 non-overlapping 
four-bit blocks Wi := (Bi, . . . , B 4 ), . . . , Wso ■= {B^n, . . . , i? 32 o)- On each sample 
a x^-test (or more precisely, a x^-test for goodness of fit ([7], 69)) is applied. For 
this, the Wj are interpreted as binary representations of four-digit numbers. For 
i = 0, . . . , 15 the frequencies fr{i) := \{j < 80 | Wj = t}| are determined and 
finally 

(/r( i) - 5)^ 

5 



T ■= 



15 



2=0 



( 1 ) 



The null hypothesis, i.e. that the tested sample was generated by an ideal noise 
source, is rejected if the test value T exceeds 65.0. A rejection of the null hy- 
pothesis causes a noise alarm which puts the TRNG out of service. The TRNG 
has to pass extensive investigations before it can manually be restarted by an 
authorized person. 

The system administrator has laid down that there should not occur more 
than 0.027 noise alarms per TRNG and year in average if the TRNG produces 
tolerable das-random numbers. (The numerical value 0.027 clearly depends on 
the concrete application. For other applications, however, smaller values may be 
appropriate.) To reach this goal the designer of this online test has chosen the 
rejection area (65.0, 00 ). His considerations were the following: It is reasonable 
to assume that each TRNG executes about 530000 x^-tests per year. If the 
das-random numbers were generated by an ideal random number generator the 
test variable T was approximately Xis-distributed ([7], 69) and thus Prob(T > 
65.0) « 3.4- 10“®. This yields an expected number of 0.018 noise alarms per year. 
As the das-random numbers generated by the investigated TRNG prototypes did 
not reveal serious statistical weaknesses the online test designer expects that the 
average number of noise alarms per TRNG and year will not exceed the given 
upper bound 0.027 = 1.5 • 0.018. 

However, this argumentation is not quite correct. Even for ideal sequences 
the test variable T is not exactly but merely approximately Xis-distributed. 
In fact, the 4-tuples are multinomially distributed and the exact probability 
Prob(T > 65.0) is about 3.8 • 10~^. If X denotes a Xi 5 “distributed random 
variable the absolute error |Prob(T > 65.0) — Prob(X > 65.0) | is surely small 
but not the relative error 



|Prob(T > 65.0) - Prob(X > 65.0) | 
Prob(X > 65.0) 



10.1 



(2) 



(Note that we use Prob(X > 65.0) as denominator but not Prob(T > 65.0) 
because the predictions of the test designer are based on the approximate prob- 
ability.) For the scenario described in Example 2 this means that even ideal 
random number generators would cause about 11 • 0.018 « 0.2 noise alarms per 
year in average; the TRNGs maybe even more. Anyway, this exceeds the upper 
bound 0.027 considerably. To avoid this, of course, the test parameter 65.0 could 
be increased (e.g. to 75.0). However, as the amount of increase is not based on 
a solid computational basis but more or less arbitrary it may happen that even 
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serious statistical weaknesses will not be detected. In particular, a strong (data- 
compressing and hence throughput-reducing) mathematical follow-up treatment 
is absolutely inevitable. 

The xfg-approximation may seem to be terribly bad. However, this is not the 
case: Note that Prob(T > 30.6) = 0.01025 and Prob(X > 30.6) = 0.00993, for 
example, with an relative error of about 0.03. The relative error is maximal at 
the tail of the x^ 5 -distribution, i.e. for large rejection boundaries z. Increasing 
the sample size (here: 320 bits) reduces the relative error for each fixed value z 
(here: for z = 65.0) since the exact distributions converge to the x^ 5 -distribution 
as the sample sizes tend to infinity. 

However, these considerations point to a serious general problem. In partic- 
ular, especially if the sample size of a statistical test is small, one has to be very 
careful when using an approximate distribution of the respective test variable at 
its tail. To obtain the exact rejection probability for ideal sequences in Example 
2 we just had to count the tuples in ({0, 1}"^)®° for which the x^-test value is 
< 65.0. However, although symmetries were exploited the computational effort 
was considerable. For non-ideal sequences it was not practically feasible to decide 
whether demand (R2) or (R3), resp., is fulfilled. However, online tests with such 
unpleasant properties are widely used in practice. In many cases even for ideal 
sequences the exact distribution of the test variable cannot be determined. 

5 A New Online Test Procednre 

In Sect. 5 we describe a new online test procedure. We will show later that it 
meets all requirements formulated in Sect. 3. In Sect. 8 it will be illustrated at 
two examples. 

Step 1: First, the statistician has to choose a statistical test, the so-called “basis 
test”, and to fix its sample size n. This may be a x^-test or any tests from [7], for 
example, provided that the needed memory, the lines of code and the execution 
time are acceptable for the used device and the intended applications. Ideally, 
the basis test should be chosen with respect to the mathematical model of the 
TRNG, or more precisely, of the random variables Bi, B 2 , ■ ■ ■■ (Of course, this 
mathematical model should have been confirmed by extensive statistical investi- 
gations of some TRNG prototypes. As the choice of the basis test does not affect 
the general principle of our online test procedure we will not pursue this topic 
in the remainder.) In the following Eq(T) denotes the mean of the test variable 
T under the null hypothesis, that is, if the random variables Bi^ B 2 , ■ . ■ were iid 
and equidistributed on I?das- 

Step 2: With respect to the intended applications minimal requirements on the 
distribution of the random variables Bi,B 2 t-. have to be specified. (Exam- 
ple: Based on the mathematical model of the noise source and the evaluation 
of TRNG prototypes the test designer concludes that the binary random vari- 
ables Bi,B 2 , ■ ■ ■ are Markovian. For the intended applications Prob(Bj = 1) G 
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[0.48,0.52] and |Prob(Bj+i = 1 | Bj = 0) + Prob(Bj+i = 0 | Bj = 1) - 1| < 0.01 
are viewed as sufficient. In particular, it may be reasonable to choose a ba- 
sis test which considers the one-step transition frequencies.) Time intervals or 
events have to be specified after those a basis test has to be executed, e.g.: al- 
ways, one basis test per second, one basis test after an external call for random 
numbers, basis tests within the idle time of the device (if the TRNG is part of 
a larger cryptographic system) etc. With regard to the intended applications a 
reasonable upper bound for the average number of noise alarms within a time 
interval (year, month etc.) has to be specified. This upper bound should not be 
exceeded for any distribution of the random variables Bi,B 2 ,... which meets 
the minimal requirements specified before. Moreover, the consequences of a noise 
alarm have to be laid down, e.g.: The TRNG is put out of service, no further 
random numbers are produced till a check of the noise source and / or a manual 
restart of the TRNG, or something like that. 

Step 3: A test suite consists of at jnost N basis tests. The basis test values 
are denoted with Ti,T 2 ,-- - while Hq ■= Eq(T). In step j > 1 a basis test is 



performed, and the basis test value Tj is determined. Then Hj := (1 — + 

(3Tj is computed (/3 <C 1) and rounded to a multiple of where c is a fixed 
integer. Moreover, the following decision rules have to be considered: 

(A) : if Tj ^ [r, s] noise alarm (3) 

(B) : if Tj-k+i, ■ ■ ■ ,Tj ^ [t, u\ stop the test suite (4) 

{C)\ ii Hj ^[v,w] stop the test suite (5) 



The parameter r and s should be chosen that a violation of decision crite- 
rion (A) is absolutely unlikely unless the random noise source has totally broken 
down. Gonsequently, a violation of decision rule (A) causes a noise alarm. Al- 
ternatively, if X consecutive test suites have been stopped due to (B) or (G) this 
also causes a noise alarm. Otherwise, after a test suite has been finished (due to 
a stop or because N basis tests have been executed) the next test suite begins. 

The choice of the parameters n, N, j3, c, r, s, k, t, u, v, w and x should consider 
the goals formulated in Step 2. Without loss of generality we may assume that the 
parameters v and w are multiples of 2“°. We point out that storing and updating 
the “history value” Hj needs no more than integer arithmetic. For this, we set 
[3 := 2~^ for a suitable integer b. Before the test suite begins Hq := Eq(T) is 
rounded to a multiple of 2“°. To update the history value Hj-i the actual basis 
test value Tj is rounded to a multiple of 2“'^ and then 

Hj := ((2^ - l)HJ_^ + Tj + 2'’-!) » b (6) 

is calculated. (As usually, “>> 6” denotes a right shift of b bits.) We point out 
that if Hj is calculated with a floating point arithmetic the basis test value Tj 
need not to be rounded before (cf. Remark 2 in Sect. 7). 
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6 Rationale and Advantages 

of the New Online Test Procedure 

Roughly speaking, statistical results get more reliable the more tests are per- 
formed. The value Hj “contains” the history of the actual test suite up to step j 
without storing the test values Ti,T 2 , . . . ,Tj explicitely. Decision rules (A) and 
(B) shall detect a total breakdown of the noise source or if the statistical quality 
of the das-random numbers has rapidly become worse, resp. The main task of 
decision rule (C), however, is to detect weaknesses in the long-term behaviour. 

If the basis test values can be calculated with integer arithmetic (which should 
exist anyway) then the whole online test procedure needs no more than integer 
arithmetic (cf. Sect. 5). As the evaluation of the decision rules (A), (B) and (C) 
requires only little running time, a few extra lines of code and little extra memory 
our online test procedure perfectly meets demand (R4). In the worst case decision 
rule (A) requires only a few das-random numbers more than the sample size n of 
one basis test to detect a total breakdown of the noise source. Hence requirement 
(Rl) is also fulfilled. In Example 2 we described an online test which is widely 
used in practice. Even for ideal random number generators it required enourmous 
computational power to determine the expected number of noise alarms within 
a particular time interval. For (non-ideal) TRNGs however, the system designer 
had almost no control what is going on. For the online test procedure described 
in Sect. 5, however, for each parameter set and for each assumed distribution of 
the B\, B 2 , ■ ■ ■ we can at least approximately determine the expected number 
of noise alarms within a time interval. That is, we also have control on the 
effects of the test procedure if it is applied on non-ideal das-random number 
sequences. A suitably chosen parameter set n, N, b (/3 := 2“^), c, r, s, k, t, u, v, w 
and X supports the goals formulated in Step 2 of Sect. 5. Especially, it meets the 
requirements (R2) and (R3). 



7 Mathematical Background 

In this section we determine the average number of test suites until a noise alarm 
occurs. As each basis test requires a large number of das-random numbers we 
may assume that the test variables 

Ti,T 2 ,... are iid, (7) 

regardless of the distribution of the random variables Ri, R 27 • ■ • (which may be 
dependent!) and the test strategy, i.e whether all das-random numbers are tested 
or not (see Sect. 5, Step 2). The distribution of the test variables Ti,T 2 , . . ., of 
course, depends essentially on that of Bi, B 2 , ■ ■ ■ , Bn- 

The only task of decision rule (A) is to detect an eventual total breakdown 
of the noise source. The probability that (A) causes a noise alarm is absolutely 
negligible unless the random noise source has indeed totally broken down. We 
hence restrict our attention to decision rules (B) and (C). First, we first derive 
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a formula to calculate the probability Pst for a stop of a test suite. Recall that 
the history values Hi, H 2 , . ■ ■ and the parameters v and w are multiples of 2“°. 
Let Yj denote the largest integer for which ■ ■ ■ ,Tj ^ [t,u]. 

(Especially, Yj := 0 if Tj G [t, u].) We interpret the numbers Ti, I 2 , ... as realiza- 
tions of random variables Yi,Y 2 , Due to (7) the random vectors {Ho,Yq := 

0),{Hi,Yi), . . . form a homogeneous Markov chain on the infinite state space 
{. . . , — 2“°, 0, 2“°, 2 • 2“'^ . . .} X {0, 1, . . .}. Therefrom we derive a homogeneous 
Markov chain Zq, Zi, . . . on the finite state space 

17 = {{h,y) I h € [u, w],h is a multiple of 2“°, 0 < y < fc} U { 00 }. (8) 

In particular, Zj attains the state 00 if (1 — j3)Hm-i + (iT^ ^ [u, w] or = k 

for any m < j whereas Zj := (Hj,Yj) else. That is, Zj attains the state 00 if the 

test suite has been stopped till time step j. Especially, 00 is an absorbing state. 
(For the mathematical background of finite Markov chains the interested reader 
is referred to [8].) 

Next, we determine the transition matrix Q = (quii,ui 2 )LJi,u: 2 &o We point out 
that (Hj-i,Hj) = {h, h') for h, h' G [u, w] iff (1 — f3)h + fiTj G [h' — + 

2“°“^), or equivalently, iff Tj G/3“^ \h' — 2“'^“^ —(1 — l^)h, h' + 2“°“^ ~(1 ~ P)h)- 
Elementary but careful considerations yield the transition matrix 

Q — (QoJi ,OJ2 with transition probabilities Qlji,lj 2 — 

Prob (Tj G Ah,h' n (1R\C)) if uji = {h, y),L 02 = {h' , j/ -I- 1) and y < k — 1 

Prob (Tj G Ah,h' n C) if uji = {h, y),u >2 = {h', 0) 

Prob (Tj G IR \Dh) if uji = (h,y), u )2 = 00 and y < k — 1 s 

' Prob(TjG(lR\Dh)U(lR\C))ifwi = (/i,fc-l),W 2 = oo 
1 if = W 2 = 00 

0 else 

\ 

where we used the abbreviations 

Ah,h' : = p-^[h' - 2-"-i - (1 - /3)h, h' + 2-"-i - (1 - ^)h) , 

C : = [t — 2“'^“^, M -I- 2“^^“^) and 
Dh : = I3~^[v - 2-"-i - (1 - I3)h, w + 2~^-^ - (1 - (3)h) . 



Remark 2. If the basis test value Tj is rounded to a multiple of 2 before it is 
“mixed” with i7j-i (cf. end of Sect. 5) in (9) the terms “Prob(Tj G . . .)” should 
read “Prob(roMnd(Tj) G . . .)” where round(-) temporarily denotes the round-off 
function. 

Now let denote the column vector with |17| components which are all zero ex- 
cept the component indexed by oj which equals 1. As Zq = {Hq, To) = (Ao(T), 0) 
we obtain 

Pst ■= Prob(test suite is stopped) = Prob(ZN = 00 ) = t’(Eo(T) ,0)Q^^(oo)- (10) 
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The probability that x particular test suites are stopped is 
1 - Er=i(l - = 1 - (1 - Wald’s equation ([5], 50) hence 

implies 

E{#test suites per noise alarm) = — Ps^)Pst — 

Pst 

In the following we denote the distribution oi B 2 , ■ ■ ■ , Bn with ^[ 11 ] while 
Fi^[n]i') denotes the cumulative distribution function of the test variable Ti if the 
basis test is applied on Bi,. . . , Bn- Especially, fj,[n] means that the Bj are iid 
and equidistributed on I7das (null hypothesis). 

To initialize the transition matrix Q we have to know the cumulative dis- 
tribution function of Tj. What’s the difference to the situation in Example 2? 
Also there the knowledge of E)y[„](65.0) would have solved the main problems. 
In particular, one could check whether (R2) and (R3) are fulfilled. As already 
mentioned in Sect. 4 even for p[n] the relative error between the exact cumula- 
tive distribution E)j[„](-) of Ti and that of the yf 5 -distribution is very large at 
the (extreme) tails of both distributions. In the online test procedure described 
in Sect. 5, however, the factor (3 is small and thus even an extremely large single 
basis test value Tj will not influence the history variable Hj considerably unless 
a total breakdown of the noise source has just occurred. For decision rule (C) 
the totality of all basis test values up to this moment is essential while the prob- 
ability that decision rule (B) stops the actual test suite depends essentially on 
1 — (Fn[n]{u) — Fn[n]{t)) and, of course, on k. We recommend to choose t and u 
that for the tolerable distributions iy[n] this probability is > 10“^. In particular, 
unlike as in Example 2, for i^[n] = fj,[n] the deviation the approximate cumulative 
distribution function of Ti from E)j[n] will not influence Pst considerably. 

For general v[n] we may approximate by an empirical cumulative distri- 
bution function Fn[n\emp which we derive with a stochastic simulation (see, e.g. 
[9], [5]). For this, we need a fast pseudorandom number generator with good sta- 
tistical properties. (Unpredictability of the pseudorandom numbers is irrelevant 
in this context.) A sound candidate is, for example, the recursive algorithm 

Xn+i = axn + ^ (mod 2®"^), with a=l (mod 4), a > 2^^®, (12) 

a so-called linear congruential generator. Setting Vj := Xj2~^"^ yields a sequence 
of standard random numbers. The standard random numbers behave similarly 
as realizations of iid random variables Vi , V 2 > • ■ ■ which are equidistributed on the 
unit interval [0, 1). From the standard random numbers one derives a sequence 
B[,B' 2 t ■ ■ which is viewed as a realization of Bi, B 2 , ■ ■ ■■ (Example: Let the Bj 
be iid binary random variables with Prob(Bj = 1) = 0.48. Then we set Bj := 1 if 
Vj < 0.48 and B' := 0 else.) We apply the basis test to B[, B'.^ and compute 

the respective basis test value T{. Repeating this process K times {K > 10®) we 
obtain an empirical cumulative distribution function FV[„]emp which we use for 
the initialization of the transition matrix Q. Due to Glivenko-Cantelli’s theorem 
([ 6 ], 145) the absolute value sup^^giP^ | F)^[„](a;) — Fn[n]emp{x)\ should be small 
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which is essential for decision criterion (C). Concerning decision rule (B), the 
relative error |(1 T -^z/[n](^)) (1 -f^iy[n]emp(^))|/(l 

K[n]emp{u) + F^,[n]emp{t)) should be Small as 1 - F^[„](m) - -F)xH(i) > 10“^ for 
typical choices of t and u. To obtain a reliable approximation of 1 — (65.0) for 
the x^-square test in Example 2, however, the parameter K had to be gigantic. 

Remark 3. (i) Of course, F^yn\emp is not exact. In principle, this could cause 
a bad approximation of the exact probability Pst- We gave reasons why this 
should not be the case. Moreover, stochastic simulations support this opinion. If 
the empirical distribution was derived twice {K = 10®) for the same distribution 
j^[n] the respective Pst-values usually differed less than 1 per cent from their 
arithmetical mean. This shows that the derived results are stable. In (ii) we give 
a formal argumentation that decision rule (B) amplifies small relative errors no 
more than by factor k. 

(ii) The probability that at least k consecutive test values Ti,T 2 , . . . ,Tat lie 
outside [t,u] is about 1 — (1 — where p temporarily stands for the 

probability Prob(Ti ^ [t,u]) (or Proh{round{Ti) ^ [t, u]), resp.; cf. Remark 2) If 
p' denotes an approximation of p the relative error equals |(1 — (1 — p^'j^P~P')'j — 
(1 — (1 — ^)|/(1 — (1 — )). If (TVp^)^, {Np'^Y' <C 1 the relative 

error is about |(1 — p)p^ ~ (1 ~ ~ p')p'^ ■ If additionally p ~ p' (which 

is likely, for example, if p > 10“® and K > 10®) this term further simplifies to 
k\p-p'\/p'. 

8 Examples 

In Sect. 8 we discuss two examples. Especially, Example 3 provides an appropri- 
ate solution for Example 2. The effect of the particular parameters is explained 
in Sect. 9. 

Example 3. We consider the same situation as in Example 2 (cf. Sect. 3). Due 
to the construction of the noise source it may be assumed that the random vari- 
ables Bi,B 2 ,. ■ ■ are iid but not necessarily equidistributed (cf. Remark 4(iii)). 
Extensive statistical investigations of TRNG prototypes have confirmed this hy- 
pothesis. For the intended applications it is absolutely acceptable (“tolerable”) 
if Prob(Bj = 1) G [0.49,0.51]. Otherwise, a noise alarm should occur sooner or 
later, depending on the “degree” of the statistical weaknesses of the das-random 
numbers. If Prob(Bj = 1) < 0.475 or Prob(Bj = 1) > 0.525, however, a noise 
alarm should occur soon. Recall that per TRNG about 530000 basis tests are 
performed per year (cf. Example 2). 

Proposed solution: Basis test: x^~test on 128 four-bit blocks (i.e. n = 512). Fur- 
ther, we use the parameter set N = 512, (3 = 1/64 (i.e. 6 = 6), c = 5, r = 0.0, 
s = 200.0, k = 3,t = 0.0, u = 26.75, v = 13.0, w = 17.0, a; = 3. 

The values in Table 1 were derived on basis of empirical distribution functions 
as described in Sect. 7 {K = 10®). The right-hand column of Table 1 gives the 
expected number of noise alarms per year. In particular, if Prob(Bj = 1) ^ 
[0.475,0.525] a noise alarm will occur after a few test suites. 
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Table 1. Example 3: Expected number of noise alarms per year 

T) u/TD TP ( # noise alarms \ 

Prob(Bj = 1) E yggj J 



0.500 


0.0162 


0.004 


0.495 


0.0184 


0.006 


0.490 


0.0289 


0.024 


0.485 


0.0745 


0.396 


0.480 


0.2790 


16.6 


0.475 


0.7470 





Table 2. Example 4: Expected number of noise alarms per year 

T) u/D TP ( # noise alarms \ 

Prob(Bj = 1) E yggj J 



0.500 


0.0151 


0.00005 


0.495 


0.0180 


0.00011 


0.490 


0.0349 


0.0015 


0.485 


0.1096 


0.1332 


0.480 


0.3866 


14.5 


0.475 


0.8501 





Remark 4- (i) The calculation of the basis test values requires no more than 
integer multiplication and addition and, finally, a division by 8 = 2^. 

(ii) For c = 6 instead of c = 5 the Pst-values are some percent larger as the 
Markov process Zq,Zi,. . . is less “inert” (cf. Sect. 9). 

(iii) For simplicity, in Examples 3 and 4 we assume that the random variables 
Bi,B 2 , ■ ■ ■ are iid. This, however, need not be the case for all types of random 
noise sources. We point out that dependent random variables Bj can be handled 
in the same way as iid ones. (Numerical example: Let the random variables 
Bi, B 2 , ... be Markovian with Prob(Bj+i = 1 | Bj = 0) = 0.490 and Prob(Bj_|_i = 
0 I Bj = 1) = 0.490. Using the same parameters as in Example 3 yields Pst ~ 
0.0243, and hence about 0.014 noise alarms are expected per year.) Of course, 
if we drop the assumption that the Bj are iid we usually have to consider much 
more distributions than listed in Table 1. 



Example 4- We consider the same situation as in Example 3. However, due to 
the intended application the expected number of noise alarms per year must 
not larger than 0.0015 if the TRNG produces appropriate das random numbers. 
(For example, the noise source could be part of a smart card which is used by 
customers to execute e-commerce applications. If the TRNG causes a noise alarm 
the smart card denies further service and has to be replaced by a new one.) 

Proposed solution: Basis test: y^-test on 128 four-bit blocks (i.e. n = 512). 
Further, we use the parameter set N = 512, (5 = 1/64 (i.e. b = 6), c = 5, r = 0.0, 
s = 200.0, k = 4,t = 0.0, u = 24.0, v = 13.125, w = 16.875, x = 4. 
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Remark 5. Our online test procedure enables “controlled” testing. If “sound” 
TRNGs generate das random numbers which themselves are appropriate for the 
intended applications and if the mathematical model of the physical noise source 
and thus that of the das random numbers is reliable (i.e., if we can be sure 
that the (eventually TRNG-specific) distribution of the das random numbers 
is contained in the assumed class of distributions) this usually makes a strong, 
data-compressing (throughput-reducing!) mathematical follow-up treatment dis- 
pensable. If also the reduced data-rate is sufficiently large for the intended ap- 
plications, however, a data-compressing follow-up treatment may be used as an 
additional security mechanism, even if this may not actually be necessary. 



9 Fine-Tuning of the Parameter Set 



If the basis test and the parameter set have been chosen suitably the online test 
procedure should perfectly meet the particular requirements of the intended ap- 
plications. In Sect. 7 we described how to compute Pst and the expected number 
of test suites until a noise alarm occurs. However, as the computation of Pst is 
time-consuming we cannot try thousands of randomly chosen parameter sets. 
Below, we briefly describe the effect of particular parameters. 

n: Unless it is too small the sample size n usually does not influence the dis- 
tribution of the basis test variables Tj if iy[n] = fj,[n] . If vln] yf p[n], however, 
increasing n often implies higher rejection rates. Example: Let T := 2{Bi + 
■■■ + Bn — 0.5n)/^/n which merely considers the number of ones within the 
sample. If the Bj are iid with Prob(Bj = I) = p the central limit theorem 

implies Prob(|T| > a) = 1 — $ ^a/i/4p(l — p) — \/n(p — 0.5)/i/p(l — p)^ -I- 

$ (-a/\/4p(l -p) - Vn(p - 0.5)/ ^/p(l — p)^ where <?(•) denotes the cumula- 
tive distribution function of the standard normal distribution. If a = 2.575 and 
p = 0.48, for example, for n = 128 (resp., for n = 512) we obtain the probability 
0.018 (resp., 0.048). If p = 0.5, however, this probability equals 0.01 for both, 
n = 128 and n = 512. 

N: Due to (10) it is reasonable to choose a power of 2 as this minimizes the num- 
ber of matrix multiplications and thus the computation time. Furthe, it avoids 
unnecessary round-off errors when computing the probability Pst- 

x: For small Pst equation (11) is essentially determined by the term pj)*. Thus, for 
different Pst -values increasing the parameter x amplifies the ratio of the expected 
number of test suites till the first noise alarm occurs. (Example: If Pst;i = 2pst;2 
it is p^/^ = Pst%/8 but p^l\ = p/j(^2/16-) This effect can be used to “separate” 
the tolerable from the non-tolerable distributions. 
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/?: The smaller (3 := 2 ^ the smaller is the influence of single basis test values 
on the history values Hi, H 2 , .... 

c: The history variables Hq, Hi, .. . may be interpreted as a “weighted” random 
walk on [u, ru]n{. . . , —2“'^, 0, 2~‘^, 2-2~‘^, . . .} with absorbing state 00 . The smaller 
c the more “inert” is this random walk and hence the smaller is Pst- In particular, 
Hj ^ Hj_i iff (1 - (3)Hj_i + l3Tj i Hj_i + [-2-‘^-\2-<^~^) iff T,- ^ Hj_i + 
We recommend to choose b,c £ {5,6}. (Recall, however, 
that the transition matrix Q has = [k{2'^{v — w) + 1) + 1]^ entries.). 

10 Conclusions and Final Remarks 

In this paper we proposed a new online test, or more precisely, a new online test 
procedure for which it is practically feasible to determine the expected number of 
noise alarms within a time interval, even if the tested random numbers are not in- 
dependent and equidistributed. The system designer can vary a whole parameter 
set and hence can fit the test to the very special requirements of the intended ap- 
plications. In particular, this makes data-compressing (i.e. throughput-reducing) 
mathematical follow-up treatments in many cases dispensable. Compared with 
“ordinary” online tests the proposed online test procedure does only need a little 
more memory, some additional lines of code and slightly more running time. 
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Abstract. In this paper we use the Hessian form of an elliptic curve 
and show that it offers some performance advantages over the standard 
representation. In particular when a processor allows the evaluation of a 
number of field multiplications in parallel (either via separate ALU’s, a 
SIMD type operation or a pipelined multiplication unit) one can obtain 
a performance advantage of around forty percent. 



1 Introduction 

Much research has been conducted on implementing cryptographic operations 
on various types of computer architectures. For example standard superscalar 
RISC and CISC processors [3] [6] , smart cards [9] and FPGAs [8] . In this paper 
we are interested in producing highly efficient implementations of operations 
needed in elliptic curve cryptography by exploiting parallelism. 

In [3] the instruction level parallelism (ILP) in the various AES candidates 
was examined. This is an important area of research for any cryptographic algo- 
rithm as most algorithms will be implemented on superscalar processors which 
make use of ILP to increase performance. In most cryptographic algorithms the 
basic blocks are dependent, hence the only natural level of parallelism is at the 
instruction level. We shall show in this paper that this is not true for elliptic 
curve cryptosystems. We shall show that for elliptic curve systems one can make 
use of parallelism at the basic block, or function, level. We shall then go on to 
show that a special class of elliptic curves, namely those with a point of order 
three, have addition laws which are highly parallel. In addition the curves will 
be able to be used in both even and large characteristic. We shall give a com- 
parison of a standard elliptic curve representation against the Hessian form on 
an FPGA. 

We see significant advantages for using the ideas in this paper when we 
consider the use of custom processors for cryptographic operations. These are 
common at both the high and low end of the markets: Smart cards require spe- 
cialized cryptographic coprocessors to be able to perform a single cryptographic 
operation in a reasonable amount of time, whilst high end servers often require 
cryptographic accelerator boards to as to increase the throughput of the overall 
traffic. 

We envisage a processor which has the ability to conduct a number of multi- 
precision arithmetic operations in parallel. We assume that the field is either 
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Fp or F2", as is common in elliptic curve systems. This can be implemented 
in a number of ways, either as a SIMD operation, via a number of standard 
coprocessors operating in parallel, by having a number of Arithmetic Logic Units 
(ALUs) operating in parallel or by having a single pipelined multiplication unit. 
To aid the discussion later we shall assume up to three field operations can 
be applied in parallel in a SIMD fashion, i.e. at most three multiplications or 
three additions. It will turn out that performing additions in parallel is not as 
important, and in any case can be done virtually for free in characteristic two. 



2 Cryptographic Algorithms 



It is clear that standard implementations of RSA and discrete logarithm type 
systems cannot make use of the architecture proposed in the previous section. Of 
course they could perform a number of RSA exponentiations in parallel, but we 
would not obtain a performance advantage for a single modular exponentiation. 
The reason for this lack of improvement is that each modular multiplication 
required in a group exponentiation algorithm is dependent on the previous op- 
erations. 

For elliptic curve cryptography (ECC) systems the situation is very different. 
In ECC one trades smaller sizes for the field elements against a more compli- 
cated group operation. We have already remarked that the group exponentiation 
techniques do not offer any opportunity for parallelism. However, the group op- 
eration itself does offer such opportunities. 

To see this consider the following description of the doubling formulae for a 
point P = (A, Y) on a curve over a field of characteristic p > 3 given by 

- 3 X + b. 



We assume the point is given in Jacobian projective coordinates P = [xj ,yj z^') 
and the output is given by (x', y', z'); 



Ai = 

A4 = yz 

A7 = A| 
Ag = Ai 



A2 = A3 = y^ 
As = XA3 Ag = A§ 
z' = 2A4 As = 4A5 



Aio = 8Ag All = 3Ag Ai 2 = 2As 
Ai 3 = All 
x' = Ai3 — Ai2 
Ai 4 = \% — x' 



Ais — A11A14 

y' = Ais — Aio 



Each row corresponds to the (at most) three operations which can be carried out 
in parallel. As one can see one can obtain a limited improvement in performance 
by exploiting the parallel nature of the computation. However, this is rather 
restricted due to our SIMD based constraint of not allowing the mixing of field 
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additions and multiplications. Even if we dropped this constraint the above 
algorithm would not improve performance that much. If we assume each row in 
the above table can be executed in the same time on a SIMD/pipelined style 
processor as a single operation could on a standard processor, we would obtain 
roughly a 50 percent performance improvement. 

Similar considerations apply to all the other standard algorithms for addition 
and doubling on elliptic curves in even and large prime characteristic. There is 
generally a set of operations at the beginning of each operation which lend 
themselves to parallel execution followed by a sequence of operations which are 
highly dependent. Overall one can obtain roughly a 50-60 percent improvement 
in performance. 

In the next section we present a special type of elliptic curve which allows 
one to obtain a five fold increase in performance over the standard sequential 
point addition algorithm and a three fold increase over the standard sequential 
point doubling algorithm. 

3 The Hessian Form of an Elliptic Curve 

Let fc = Fg denote a finite field with q a prime power such that q = 2 (mod 3), 
we include the case of characteristic two fields. Let E denote an elliptic curve 
over k which has a fc-rational point of order 3. The restriction on the choice of 
k is for two reasons; Firstly it implies that although we have a point of order 3 
we only have two of them rather than a full set of eight. Secondly it implies that 
the construction below is guaranteed to apply. 

Note that this constraint on the field and curve means that the recommended 
curves specified in the ANSI and FIPS standards cannot be made into Hessian 
form. However, these constraints are consistent with the recommendations for 
curve parameters specified in the IEEE, ANSI and SECG standards. So the 
Hessian form is still able to be used in standard compliant implementations, one 
just cannot use the recommended curves contained in some standards. 

By moving a point of order three to the origin, we can assume our elliptic 
curve has the form 

E : + aiXY + a^iY = . 

This curve has discriminant 

A = a|(a^ — 2703 ) = 0315 . 

The points of order three on E are given by (0,0) and (0, — 03 ). 

We wish to find a more convenient model for our elliptic curve E. To do 
this we perform the following transformations: We let /i denote a root of the 
polynomial 

y3 _ ^ Q_ 

Since q = 2 (mod 3) every element in k has a unique cube root and we can 
determine /r from the formula 

A. = i ((-2703^2 -<5^)'/' + 5 ). 
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Now define 

a — S 

D = . 

M 

The curve E is then birationally equivalent to the curve 

C : + y^ + = Dxyz, 

which is called the cubic Hessian form [2]. The change of variable to pass from 
if to C is given by 



_ ai{2y.-S) _ , _ , 

X — — — X + r +03, 



y = 



— s 

-ain 



Sfi — S 

-oiM 

— S 



X-Y, 

X-as. 



3.1 Example Curve 

Our example curve will be in characteristic two, defined over the field of 2^®^ 
elements. We shall represent the field by a polynomial basis in t over F 2 , where 

^191 ^ 

Elements of k we will represent by hexadecimal numbers, prefixed by the string 
Ox, which correspond to the associated polynomial being thought of as a poly- 
nomial over the integers and then evaluated at 2. For example 0x11 corresponds 
to the polynomial + 1 . 

Consider the elliptic curve given by 



E' : y^ + xy = x^ + x^ + b 



where 



b = 0x4DE3965E00F2AlC6C9750156A6FEFBE5EEF780BF3EF20E48. 

This curve has group order 

6 - q= 3138550867693340381917894711648254768837315541933943803842, 

where g is a prime. A point of order 3 on E' is given by the point (xs, 7 / 3 ), where 

X 3 = 0x4763CFBC4340674B749E57887850E92C9B6BEDF58EEDC3BF, 

?/3 = 0xl4A96AlE53DCC3E73CFB22B80E8658CE0D6D8E82ED2AEC7D. 

If we perform the following change of variable 

X = X + X 3 , Y = y + sx + X 3 , 
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where 

„ _ + J/3 

^ ? 

X 3 

then we obtain an elliptic curve E in the form 

E : + XY + X3Y = X^. 

We can now use the above transformation to determine that the Hessian form 
C is given by 

+ y^ + = Dxyz, 

where 



D = 0xl6A4C7C2030FAD1380ABF8C2D47DC3E0C20AF62F6EDD06A7. 

A point of order g on C is given by (x, y, z), where 

X = 0x52FD0CE78D0651B4F66D2F4E12E170CA3E429F6A06433B22, 
y = 0xlBECA50368403F3D13173968082B035397C77830A9D90E5D, 
z = 0x2B08F7C0CCAC86151AA6FECABDD2D052BD60924F28A6A78E. 

4 The Hessian Group Law 

The group law on curves in Hessian form has been known to be particularly 
simple for a long time, see [1, Formulary], [2] and [5]. The zero of the group law 
on C is given by (1, —1, 0). The two points of order three are given by (0, 1, —1) 
and (1, 0, -1). If P = {xi,yi,Zi) and Q = (x 2 , t/ 2 , - 22 ), we define 

-P = {yi,xi,zi), 

P + Q = (X3, 2 / 3 , 03 ), 



where 



2^3 = y\x2Z2 - 2/ixiZi, 

2/3 = x\y2Z2 - xlyizi, 

Z3 = z\y2X2 - zlyiXi. 

The formulae for point doubling are given by [2]P = (x 2 , 2 / 2 , ^ 2 ) where 

X2 = yi{zl - xl), 

2/2 = xi{yf - zl), 

Z3 = Zi{xl - yl). 

These formulae apply both in even and large prime characteristic. The addition 
formulae requires 12 field multiplications, whilst the doubling formulae requires 
6 field multiplications and three squarings. 
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We first compare the sequential operation of the above group law with the 
usual formulae. 

For curves over fields of large prime characteristic, given by the usual model, 
it is best to use a mixed coordinate representation of points, see [4]. In the 
most common case of using Jacobian projective coordinates for doubling and 
mixed Jacobian/affine coordinates for addition we can perform an addition in 
8 multiplications and 3 squarings. A doubling will take 4 multiplications and 4 
squarings. 

In fields of characteristic two, again in the standard model, we can again 
use a mixed coordinate system, but this time using the projective coordinates 
proposed in [7]. One can now perform an addition using 9 multiplications and 4 
squarings. A doubling can be performed in 4 multiplications and 5 squarings. 

So for sequential execution of the multiplication operations the standard 
representation appears to be more efficient. However, the group operations for 
the Hessian form can be performed in a highly parallel way. As before we assume 
that at most three multiprecision multiplications can be performed in parallel. 

The addition of two points, P = (xi,yi,zi) and Q = {x 2 ,y 2 , Z 2 ), in the 
Hessian form can be carried out as follows 



Ai = yiX 2 

A 4 = Z\X 2 
Si = AiAe 
t\ = A 2 A 5 



A 2 = xiy2 
As = ziy2 
S2 = A 2 A 3 



A3 = X1Z2 
Ae = Z2V1 
S 3 = A 5 A 4 
^3 = AeA3 



^2 — A 1 A 4 
2:3 = Si - h 1/3 = S 2 - t 2 ^3 = S 3 - h 



The case of the two projective points being equivalent, and the doubling opera- 
tion needing to be called, can be detected from the condition 



Ai — A 2 and A 3 — A 4 . 



The point doubling operation on the Hessian can also be expressed in a similar 
highly parallel way 

Ai = x\ A2 = yl A3 = z\ 

A4 = xiAi As = J/1A2 Ae = Z1X3 

A7 = As — Ae Ag = Ae — A4 Ag = A4 — As 

X2 = 2/1 As 2/2 = a;iA7 22 = ziXg. 



5 FPGA Implementation 

We implemented the example curve above both using the standard representa- 
tion and the Hessian form on a Xilinx4000XL FPGA. The code was written using 
the HandelC language and compiler produced by Embedded Solutions Ltd. This 
converts a C like language into a netlist which is then passed through Xilinx 
tools to produce the final FPGA bitmap. 

The implementation was made to test the relative performance of the two 
models and not to maximize the speed. The FPGA implemented a simple point 
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multiplication algorithm; taking as input a 191 bit integer m and returning the 
projective point corresponding to [m]P, where P is the point of order q on the 
appropriate models given earlier. 

The point multiplication used was the simple right-to-left binary method. 
The code for both representations made as much use of the parallel nature of 
the double and add routines as could be accommodated. 

The following timings were obtained as an average over a number of calls to 
the FPGA; 



Form of Gurve 


Point Multiplication Time 


Standard (Sequential) 

Standard(Parallel) 

Hessian 


77.238 ms 
17.711 ms 
11.821 ms 



So we see that the Hessian form gives around a 40 percent performance improve- 
ment over the standard form, even when we exploit all the inherent parallelism 
in the standard formulae. 



6 Conclusion 

We have shown that using the Hessian form of an elliptic curve allows us to 
implement the point addition and point doubling operation in a highly parallel 
way. This can be exploited by a number of processor architectures such as those 
which have a SIMD style instruction set, those which have multiple ALU’s or 
those which have a pipelined finite field multiplier. 

We have shown, by implementing a demonstration example on an FPGA in 
characteristic two, that the Hessian form does in fact give a significant perfor- 
mance improvement in real life. 
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Abstract. We present a scalar multiplication algorithm with recovery 
of the y-coordinate on a Montgomery form elliptic curve over any non- 
binary field. 

The previous algorithms for scalar multiplication on a Montgomery form 
do not consider how to recover the y-coordinate. So although they can 
be applicable to certain restricted schemes (e.g. ECDH and ECDSA-S), 
some schemes (e.g. ECDSA-V and MQV) require scalar multiplication 
with recovery of the y-coordinate. 

We compare our proposed scalar multiplication algorithm with the tra- 
ditional scalar multiplication algorithms (including Window-methods in 
Weierstrass form), and discuss the Montgomery form versus the Weier- 
strass form in the performance of implementations with several tech- 
niques of elliptic curve cryptosystems (including ECES, ECDSA, and 
ECMQV). Our results clarify the advantage of the cryptographic usage 
of Montgomery-form elliptic curves in constrained environments such as 
mobile devices and smart cards. 

Keywords: Elliptic Curve Cryptosystem, Montgomery form, Fast Scalar 
Multiplication, y-coordinate recovery 



1 Introduction 

Lim and Hwang give the following problem [LHOO, page 409]: “Montgomery’s 
method is not a general algorithm for elliptic scalar multiplication in GF(p"), 
since it can’t compute the y-coordinate of fcP.” This paper completely solves this 
problem, and shows that Montgomery’s method is indeed a general algorithm. 
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1.1 Montgomery-Form Elliptic Curves 

Montgomery introduced the non-standard form : BY^ = + AX^ + X 

for elliptic curves in [Mon87], while the most standard form of elliptic curves 
is if : + ax + b, which is called the (short) Weierstrass form. While 

investigating efficient scalar multiplication algorithms on elliptic curves, some 
researchers [LD99,LHOO,Kur98,OKSOO] have independently observed that Mont- 
gomery’s method [Mon87] has an advantage in preventing timing attacks [Koc, 
Koc96].i 

1.2 Elliptic Curve Cryptosystems without the y-Coordinate 

The scalar multiplication algorithm on a Montgomery-form elliptic curve is fast, 
because it requires information on the x-coordinate only. Recall that previous 
proposed algorithms on a Montgomery form consider only the x-coordinate of 
kP, the /c-time scalar multiplication of the point P over the elliptic curve. This 
is enough for application to some elliptic curve cryptosystems including the key- 
establishment scheme ECDH, and signature generation ECDSA-S [IEEEpl363]. 

However, we should note that the ECDSA verifying algorithm cannot be exe- 
cuted without referencing the j/-coordinate of kP. Also ECSVDP-MQV, 
ECSVDP-MQVC, ECVP-NR, and ECVP-DSA, which are described in the draft 
standard IEEE P1363 [IEEEpl363], need a scalar multiplication with recovery 
of the y-coordinate. 

1.3 Montgomery Form with Recovery of the y-Coordinate 

Recently, Lopez and Dahab [LD99] extended the idea of Montgomery’s method 
[Mon87] to binary fields (i.e. ¥ 2 ^), and developed an algorithm for a scalar 
multiplication with recovery of the y-coordinate. 

However, their algorithm is valid only over binary fields, and designing an 
efficient algorithm for recovering a y-coordinate in the Montgomery form with 
non-binary fields has remained open. 

1.4 Our Contributions 

We present a scalar multiplication algorithm with recovery of the y-coordinate 
on a Montgomery-form elliptic curve over non-binary fields. 

We further compare our proposed scalar multiplication algorithm with the 
traditional scalar multiplication algorithms (including Window-methods on the 
Weierstrass form), and discuss the cryptographic advantage of using the Mont- 
gomery-form elliptic curve. 

Our analysis shows that the scalar multiplication on a Montgomery-form el- 
liptic curve, which requires no precomputation, is faster than that of the window- 
method on a Weierstrass-form elliptic curve, if the size of the definition field is 
smaller than 391 bits in a reasonable implementation with {S/M) = 0.8 and 

^ It was shown that a Montgomery-form elliptic curve is effective for preventing dif- 
ferential power analysis [OSOO]. 
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{I/M) = 30 [LHOO]. Recall that elliptic curve cryptosystems over prime fields 
whose sizes are larger than 272 bits are believed to be secure until the year 2050 
even if some cryptanalytic developments occur [LVOO]. 

We further consider the Montgomery form versus the Weierstrass form in the 
performance of implementations with several techniques of elliptic curve cryp- 
tosystems (including ECES, ECDSA, and ECMQV). We compare the amounts 
of computation and storage of our Montgomery method with recovery of the 
y-coordinate to those of the Weierstrass window methods (with simultaneous 
techniques) for point multiplication kP (resp. for kP + Q and for kP + lQ). Our 
results suggest new advantages of the Montgomery form over the Weierstrass 
form in the implementation of elliptic curve cryptosystems 

2 Elliptic Curve Schemes Using the ^-Coordinate 

The elliptic curve signature scheme ECDSA cannot be executed without ref- 
erencing the j/-coordinate of the point kP, the fc-time scalar multiplication of 
the point P. This is in contrast to some elliptic curve schemes (e.g. ECDH), 
which can be executed without the j/-coordinate. The verifying algorithm of 
ECDSA requires the operation kP + k'Q. To put it simply, computation re- 
quires the addition of a scalar- multiplied point and another point. Addition of 
points, without using their differences, on a Montgomery-form elliptic curve re- 
quires the ^-coordinates of the points. That is the reason why ECDSA requires 
the y-coordinate of kP. 

The same applies to ECSVDP-MQV, ECSVDP-MQVC, ECVP-NR, and 
ECVP-DSA, which are described in the draft standard IEEE P1363 [IEEEpl363]. 
These schemes also require the operation kP+Q, which is an addition of a scalar- 
multiplied point and another point. 

We would like to emphasize that recovering the y-coordinate completely 
solves these problems. Therefore, Montgomery’s method of scalar multiplication 
becomes a general algorithm. 

3 Recovering the y-Coordinate 

Research has studied recovery of the y-coordinate on a Montgomery-like scalar 
multiplication method in the case of an elliptic curve defined over a finite field 
with characteristic 2 [LD99]. However, no similar algorithm is known in the case 
of a Montgomery-form elliptic curve [Mon87] defined over a prime field (nor over 
an OEF [BP98]). First, we will take up the case of characteristic 2, and then we 
will focus our attention on the case of a Montgomery-form elliptic curve. 

3.1 Recovering the j/-Coordinate (p = 2)[LD99] 

Let F 2 ™ be a finite field with characteristic 2. A non-supersingular elliptic curve 
over F 2 m is defined as follows. 

+ xy = + ax^ + 6, 



where a, 5 G F 2 m and & yf 0. 
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Theorem 1 ([LD99]). Let P = {x, y),Pi = (xi,yi), P2 = {x2, 2 / 2 ) be points on 
the elliptic curve. Assume that P2 = P\ + P and x ^ Q. Then 

yi = (xi + a;){(xi + x)(x2 + x) + x"^ + y}/x + y. 

3.2 Recovering the y-Coordinate (p > 3) 

Let p be a prime and F^m be a finite field with characteristic p. A Montgomery- 
form elliptic curve over F^m is defined as follow. 

BY^ = X^ + AX^ + X, 

where A,B G F^m and B{A'^ — 4) yf 0. 

First, we construct a method for recovering the y-coordinate on a Mont- 
gomery-form elliptic curve in a similar fashion to the method on the elliptic 
curve over F 2 ^- The proof of the propositions in this section is in the Appendix. 

Theorem 2. Let P = {x,y),P\ = {x\,yi), P2 = ( 2 : 2 , 2 / 2 ) be points on a Mont- 
gomery-form elliptic curve. Assume that P2 = P\ + P and y ^ 0. Then 

{xix + l)(a;i + a; + 2A) — 2A — (a;i — a;)^a :2 

Corollary 1. Let P, P\ and P2 be as in Theorem 2 . We express P\ = 

P2 = ^), and define 17’®'^, which are given by the following: 

A[®® = 2 ByZiZ 2 Xi 

F 7 ®® = Z2 [(Ai + xZi + 2 AZi) (Xix + Zi) - 2 AZf] - (Ai - xZif X2 
Z[®® = 2 ByZiZ 2 Zi 

Then the relation (A[®®, ■^1'^°) = holds in projective coordi- 

nates. 

This method for recovering the T-coordinate needs to compute twelve mul- 
tiplications and one squaring. An example of the algorithm using Corollary 1 is 
the next Algorithm 1. 

Algorithm 1 : Algorithm for recovering the T-coordinate 
INPUT x,y,X,,Z.,X 2 ,Z 2 

OUTPUT 
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Next, we propose another method for recovering the y-coordinate on a Mont- 
gomery-form elliptic curve. 

Theorem 3 . Let P = {x,y),Pi = {xi,yi),P2 = {x2,y2),Pz = {xz,y^) be points 
on a Montgomery-form elliptic curve. Assume that P2 = P\ + P, P^ = P\ — P 
and 2/ yf 0 . Then 

(3^3 - 3^2) (a:i - x )"^ 

~ ^By 



Corollary 2 . Let P, Pi, P2 and P3 be as in Theorem 3 . We express Pi=(^, 

P2 = (^, = (^, ^), and define which are given by the 

following: 



= 4 P 2 /Z 1 Z 2 Z 3 X 1 
= (X3Z2 - Z3X2) (Xi - Zixf 
= 4 P 2 /Z 1 Z 2 Z 3 Z 1 

Then = {Xi,Yi, Zi) in projective coordinates. 

This method for recovering the T-coordinate in projective coordinates needs 
to compute ten multiplications and one squaring. Thus, this method is faster 
than that using Corollary 1. However, we have to compute the point P 3 before 
we use this method if P 3 is not given. 

Proposition 1 . Let P = (x,y) be a point on a Montgomery-form elliptic curve, 
and kP = (xk,yk) for fc = 1, 2, • • • be the scalar-multiplied point of the point P. 
For any = Xm,^ = a;„ and = Xm+n {na > n), we define X' and Z' 

which are given by the following: 

X' = Zm+n[{Xm — Zm){Xn + Z^f) + {X^ + Zm){Xn — Z„)]^ 

Z' = Xm+n[{Xm — Zm){Xn + Zjf) — {X^ + Zm){Xn — Zn)Y 

Then ^ = Xm-n is satisfied. 

The calculation of X' and Z' needs to compute four multiplications and two 
squarings. Thus, recovering the T-coordinate in projective coordinates without 
P 3 requires fourteen multiplications and three squarings. 

We should notice that a combination of these formulae reduces the compu- 
tation amount. Let us examine the computation amount of the later method in 
more detail. We have Xd = ^, Xd+i = ,Xd-i = and we transform 

the formula of 7/1 on Theorem 3 to the formula in projective coordinates with 
the substitution of these values. Then we obtain the following equation: 



{Xd-iZd+i — Zd-iXd+i){Xd — Zdx) 
4ByZd-iZd+iZ^ 



Vd = 
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Proposition 1 enables us to eliminate Xd-i and from the equation 

above. We obtain the following equation with the settings X\ = x and Z\ = 1 : 



{Zd+iU + Xd+iV}{Zd+iU - Xd+iV}U^ 
ByZd+iXd+iV^Zj 



where U = XdX — Zd,V = Xd~ xZd- On the other hand, we have Xd = ^- We 
obtain 

^ ByZd+iXd+iZdV^Xd 
ByZd+iXd+iZd V^Zd 

with the reduction with the denominator of yd for reducing the number of in- 
version. 

M, S and I respectively denote Fpm -operations of multiplication, squaring, 
and inversion. The method for recovering the j/-coordinate in affine coordinates 
needs the computation amount 15 M + 2 S + I . 

On the other hand, we set 

X™" = ByZd+iXd+iZdV'^Xd 

= {Zd+iU + Xd+iV}{Zd+iU - Xd+iV}u^ 

ZT = ByZd+iXd+iZdV^Zd 

Then the relation = {Xd,Yd, Zd) holds in projective coor- 

dinates. The method for recovering the y-coordinate needs the computation 
amount 13 M -|- 2 S. Therefore, the computation amount shrinks by one mul- 
tiplication and one squaring. An example of the algorithm using this method is 
the following: 

Algorithm 2 : Algorithm for recovering the P-coordinate 



INPUT x,y,Xd,Zd,Xd+i,Zd+i 

OUTPUT 



1. 


Ti 




XdX X 


11. 


T2 


^ ' 


T2 X Zd+i 


2. 


Ti 




Ti 




12. 


T2 




T2 


1 X y 


3 . 


T2 




Zd 


X X 


13 . 


T2 




T2 


X B 


4 . 


T2 




Xd 


-T2 


14 . 


vrec 




T2 


xXd 


5 . 


Ta 




Zd+i X Ti 


15 . 


'vrec 




T2 


X Zd 


6. 


T 4 


^ . 


Xd+i X T2 


16 . 


T2 




T3 


+ T4 


7 . 


Ti 




Ti 


xTi 


17 . 


T3 




T3 


-Ti 


8. 


T2 




T2 


X T2 


18 . 


Ti 




Ti 


X T2 


9 . 


T2 




T2 


X Zd 


19 . 


\rrec 

^d 




Ti 


xTa 


10. 


T2 


^ ' 


T2 X 


Xd+i 













In summary. Algorithm 1 needs the computation amount 12 M -|- S, and 
Algorithm 2 needs the computation amount 13 M -|- 2 S. Therefore, Algorithm 1 
is faster than Algorithm 2 by one multiplication and one squaring. 
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4 Montgomery Form versus Weierstrass Form 



It was pointed out in the previous section that we are able to efficiently re- 
cover the y-coordinate of a scalar-multiplied point on a Montgomery-form el- 
liptic curve. If we connect the traditional scalar multiplication algorithm with 
the algorithm for recovering the j/-coordinate, we construct a fast scalar multi- 
plication algorithm on a Montgomery-form elliptic curve which gives the whole 
coordinates of a scalar- multiplied point. 

and denote the computation amount for the pro- 

posed scalar multiplication algorithm with I, the size of a scalar value, ^ that for 
the traditional scalar multiplication on a Montgomery-form elliptic curve, and 
that for recovery of the j/-coordinate, respectively. Then we have the following: 



rpMon 
^ X 

rpMon 

rj-'Mon 



(I) = + {Al-2)S 

(0 = 12M + S 
( 1 ) = 

= {61 + 9)M + {41 - 1)S 



We should recall that, on a scalar multiplication algorithm over a Weierstrass- 
form elliptic curve, the algorithm using the window method in the mixed modi- 
fied Jacobian coordinates is one of the fastest [CM098]. T^^'^{w,l) denotes the 
computation amount with window width w and size 1. According to [LHOO] (and 
also see [CM098]), T^‘^'‘{w,l) is estimated as follows. 

l)=wl+ +41 + 5- 2’"-! -2w- 14^ M 

+ ( +41 + -2w-2\s 

\w + 2 J 

First, we compare the computation amount for the Montgomery-form elliptic 
curve to that for the Weierstrass-form elliptic curve with a window width of 4 
in terms of size 1. Assume that T^°'^{1) > 1), then we obtain 

(//M)<(ii-5) + (-li+l)(S/M). 

For simplicity, we denote s and t for {S/M) and {I/M), respectively. Then the 
inequality above is expressed by 

, 24t - 6s -k 54 

4- 3s 

For example, we set s = 0.8, and compute I for t = 10,20,30,40 and 50. We 
obtain I > 181, 331, 481, 631 and 781, respectively. Thus, in the case that the size 
is smaller than the bits above, the scalar multiplication on a Montgomery-form 
elliptic curve is faster than that on a Weierstrass-form elliptic curve. 



2 



Actually, the size of a scalar value is approximately the size of the definition field. 
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Second, we compare the computation amount of a Montgomery-form elliptic 
curve to that of a Weierstrass-form elliptic curve with a window width of 5 in 
terms of size 1. Assume that /), we obtain 

, 35t + 35s + 329 

6-3. ^ 

Finally, we compare the computation amount for the Weierstrass-form elliptic 
curve with window width 4 to that with window width 5 in terms of size 1. 
Assume that T^®*(4, 1) > T^®*(5,l), we obtain 

, ^ 42t + 252s + 1596 
^ sTss ■ 



To sum up the comparison between the computation amount in terms of size 
I we have seen thus far, we obtain Table 1. 

According to [LVOO], elliptic curve cryptosystems over prime fields whose 
sizes are larger than 272 bits are believed to be secure until the year 2050, even 
if some cryptanalytic developments occur. Under a reasonable assumption that 
{S/M) = 0.8 and {I/M) = 30 for prime fields [LHOO], we see from Table 1 
that the scalar multiplication with w = 5 is faster than that with w = 4 if the 
size is larger than 295 bits. Thus, the scalar multiplication on a Montgomery- 
form elliptic curve is faster than that on a Weierstrass-form elliptic curve if the 
size is smaller than 391 bits. From this viewpoint, one may say that the scalar 
multiplication algorithm on a Montgomery-form elliptic curve is faster than that 
on a Weierstrass-form elliptic curve for cryptographic use. 



5 Comparison of Our Proposed Method 
to Other Methods 

In this section, we use computation amount to compare our proposed method, 
that is the scalar multiplication algorithm with recovery of the y-coordinate on a 
Montgomery-form elliptic curve, with other methods for several schemes (ECES, 
ECElGamal, ECDSA-V and ECSVDP-MQV) [ANSI,IEEEpl363, SEC-1]. On the 
comparison with methods to compute a scalar multiplication and ECDSA scheme 
for elliptic curves over ¥ 2 ^, see [HHMOOj. 

5.1 Elliptic Curve Encryption Scheme 

An elliptic curve encryption scheme (ECES)^ needs to compute the operation 
kP. We select Montgomery’s method with/without recovering the y-coordinate 
and the window method with the width ic = 4, 5 from the methods to compute 
kP, and compare the computation amounts for the size 160 bits. 

To compute kP by Montgomery’s method with (resp. without) recovery of the 
y-coordinate, the computation needs the computation amount 1480. 2M (resp. 
1467. 4M) with assuming {S/M) = 0.8. 

® For further details about ECES, see [IEEEpl363] for example. 
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Table 1. Border key sizes between scalar multiplication methods 



(S/M) 


(I/M) 


II 

LO 

II 


w = 4/Mon 


w = 5/Mon 


1.0 


50 


359 


1249 


705 




40 


321 


1009 


589 




30 


283 


769 


472 




20 


245 


529 


355 




10 


207 


289 


239 


0.9 


50 


367 


961 


640 




40 


328 


776 


534 




30 


289 


592 


428 




20 


249 


407 


322 




10 


210 


223 


216 


0.8 


50 


375 


00 


586 




40 


335 


631 


489 




30 


295 


481 


391 




20 


254 


331 


294 




10 


214 


181 


197 


0.7 


50 


384 


658 


540 




40 


342 


532 


450 




30 


301 


406 


360 




20 


259 


279 


271 




10 


218 


153 


181 


0.6 


50 


393 


569 


501 




40 


350 


460 


417 




30 


307 


351 


334 




20 


265 


242 


251 




10 


222 


133 


167 


0.5 


50 


403 


501 


466 




40 


359 


405 


389 




30 


314 


309 


311 




20 


270 


213 


233 




10 


226 


117 


155 



“ui = 4/Mon” indicates the border key sizes between the scalar multiplication on a 
Weierstrass-form elliptic curve using the window method with a window width of 4 
and that on a Montgomery-form elliptic curve. That is, the former is faster than the 
latter if the size is larger than the border size, “w = 5/w = 4” and “ui = 5 /Mon" are 
similar in meaning to “w = 4/Mon” . 



To compute kP by the window method with the width w = 4 (resp. w = 5), 
the computation amount 1446. 4M + 41 (resp. 1449. 4M + 51) is needed. 



5.2 Elliptic Curve ElGamal Encryption Scheme 

An elliptic curve ElGamal encryption scheme^ needs to compute the operation 
kP + Q. We select Montgomery’s method with recovery of the j/-coordinate and 

^ For further details about the elliptic curve ElGamal encryption scheme, see 
[IEEEpl363] for example. 
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Table 2. Computation amounts for point multiplication kP 



Method 


Points stored 


# of M 


#ofJ 


Montgomery (traditional) 


0 


1467.4 


0 


Montgomery (with recovery of y) 


0 


1480.2 


0 


Window Method {w = 4) 


7 


1446.4 


4 


Window Method (w = 5) 


15 


1449.4 


5 



the window method with the width w = 4, 5 from the methods to compute 
kP + Q, and compare the computation amounts for the size 160 bits. 

To compute kP+Q by Montgomery’s method with recovery of the y-coordinate, 
the computation needs to compute the following operations: 

(a) The scalar multiplication kP by Montgomery’s method. 

(b) The addition kP + Q. 

(a) Since the size of k is 160 bits, the computation needs the computation 
amount 969M + 639S' which includes the computation amount for recov- 
ering the y-coordinate. 

(b) The computation needs the computation amount IIM -|- 2S for the addition 
in the projective coordinates on a Montgomery-form elliptic curve. 

Thus, to compute kP -I- Q by Montgomery’s method with recovery of the y- 
coordinate needs the computation amount 1492. 8M, assuming {S/M) = 0.8. 

To compute kP -I- <5 by the window method with the width w = 4, 5, the 
computation needs to compute the following operations: 

(c) The scalar multiplication kP by window method. 

(d) The addition kP + Q. 

(c) The computation amount is 872M -|- 718S' -|- 4/ if w = 4, and it is 879M -|- 
713S'-f 5/ ifw = 5. 

(d) The computation amount is 8M + 3S by using the addition formulae of J -I- 
A ^ J, where J is the Jacobian coordinates and A is the affine coordinates. 

Thus, to compute kP + Q hy the window method, the computation needs the 
computation amount 1456. 8M J- 4/ if re = 4, or 1459. 8M J- 5/ if re = 5. 



Table 3. Computation amount for point multiplication kP + Q 



Method 


Points stored 


# of M 


' — 1 
o 
4b 


Remark 


Montgomery (traditional) 


- 


- 


- 


impossible 


Montgomery (with recovery of y) 


0 


1492.8 


0 




Window Method (w = 4) 


7 


1456.8 


4 




Window Method (w = 5) 


15 


1459.8 


5 
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5.3 ECDSA Scheme (Verification) 

The signature verification of an elliptic curve digital signature algorithm (ECDSA)^ 
needs to compute the operation kP + IQ where P is a fixed point and Q is not 
known a priori. We select the fixed-base comb method [LL94] with the width 
w = 4, 5 -I- Montgomery’s method, and a simultaneous method [E1G85,HHM00] 
with the width w = 2, and compare the computation amount for the size 160 
bits. 

To compute the fixed-base comb {w = 4, 5) -I- Montgomery method, the 
computation needs to compute the following operations: 

(e) The scalar multiplication kP by the fixed-base comb method. 

(f) The scalar multiplication IQ by Montgomery’s method. 

(g) The addition kP + IQ. 

(e) In the case of the width w = 4, the computation needs the computation 
amount 445M -|- BBSS' by using the addition formulae of J -I- A — >■ J™ for 
the additions, the doubling formulae of J™ — 1 J for the doublings ahead of 
addition, and the doubling formulae of J™ — >■ J™ for the doublings ahead of 
doubling, where J™ is the modified Jacobian coordinates. In the case of the 
width w = 5, the computation amount is 362M + 273S. 

(f) Since the point Q is on a Weierstrass-form elliptic curve, we transform this 
into a point on a Montgomery-form elliptic curve. The computation needs 
the computation amount 2M. Then, we compute the scalar-multiplied point 
IQ by Montgomery’s method with recovery of the y-coordinate. The compu- 
tation amount is 969M + 6B9S. Then, we transform the point into a point 
on the Weierstrass-form elliptic curve. The computation amount is 2M. 

(g) For the sake of fast computation of kP + IQ, we transform the point IQ 
in projective coordinates into a point in Chudnovsky Jacobian coordinates. 
The computation needs the computation amount 3M + S. The computation 
amount IIM + 3S is needed for the addition kP + IQ. 

Thus, to compute kP + IQ by fixed-base comb (ic = 4) J- Montgomery method, 
the computation needs the computation amount 2216. 8M, assuming {S/M) = 
0.8. The number of points stored is 14. In the case of the fixed-base comb {w = 
5) + Montgomery method, the computation needs the computation amount 
2081. 8M. The number of points stored is BO. 

In the case of the simultaneous method with the width w = 2, the computa- 
tion of points stored needs the computation amount 49M + 12S' + 21 by using 
the Montgomery trick [Coh9B]. The computation of multi scalar multiplication 
needs the computation amount 122BM + lOOlS” by using the addition formulae 
of J J- A — >■ J'" for the additions, the doubling formulae of J™ — >■ J for the 
doublings ahead of addition and the doubling formulae of J™ — >■ J™ for the 
doublings ahead of doubling. Thus, the computation amount 2084. 4M + 31 is 
needed. The number of points stored is IB. 

® For further details about ECDSA, see [ANSI] for example. 
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Memory Constrained As regards the situation in which memory is con- 
strained, we examine methods in which the number of points stored is within 
10 . 

In the case of fixed-base comb (tu = 3) -I- Montgomery, the computation 
amount is 2420. OM, and the number of points stored is 6. In the case of the 
width w = 2, the computation amount is 2762. OM, and the number of points 
stored is 2. In the case of Montgomery -|- Montgomery, the computation amount 
is 2980. OM, and no extra points are needed. In the case of simultaneous {w = 1), 
the computation amount is 2585. 6M -|- I, and the number of points stored is 1. 



Table 4. Computation amount for point multiplication kP + IQ, P fixed 



Method 


Points stored 


# of M 


#ofJ 


Fixed-base comb(w = 4) + Montgomery 


14 


2216.8 


0 


Fixed-base comb(tn = 5) -1- Montgomery 


30 


2081.8 


0 


Simultaneous)® = 2) 


13 


2082.4 


2 



Table 5. Computation amount for point multiplication kP+lQ, P fixed, when memory 
is constrained 



Method 


Points stored 


# of M 


#ofJ 


Fixed-base comb(® = 2) + Montgomery 


2 


2762.0 


0 


Fixed-base comb(® = 3) -1- Montgomery 


6 


2420.0 


0 


Montgomery -|- Montgomery 


0 


2980.0 


0 


Simultaneous)® = 1) 


1 


2585.6 


1 



5.4 ECSVDP-MQV 

The elliptic curve secret value derivation primitive, Menezes-Qu-Vanstone ver- 
sion (ECSVDP-MQV)® needs to compute the operation k{P + IQ), where I is 
half the size of k, and points P, Q are not known a priori. 

To compute k{P + IQ) by Montgomery’s method (Montgomery * Mont- 
gomery), the computation needs to compute the following operations: 

(h) The scalar multiplication IQ by Montgomery’s method. 

(i) The addition P + IQ. 

(j) The scalar multiplication k{P + IQ) by Montgomery’s method. 

(h) Since I is half the size of k, the computation needs the computation amount 
489M -|- 319S', which includes the computation amount for recovering the 
y-coordinate. 

(i) The computation needs the computation amount 14M -|- 2S for the addition 
on the Montgomery-form elliptic curve. 



For further details about ECSVDP-MQV, see [IEEEpl363] for example. 
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(j) The computation needs the computation amount 957M + 6385'. 

Thus, to compute k{P + IQ) by Montgomery’s method, the computation needs 
the computation amount 2227. 2M. 

In the case of the simultaneous method with the width w = 2, to compute the 
points stored needs 63M + 155 + 31, and to compute kP + klQ by simultaneous 
method needs 1223M + 10015. Thus, the computation amount 2098.8 + 31 is 
needed. 



Table 6. Computation amount for point multiplication k[P + IQ), where I is half the 
size of k 



Method 


Points stored 


# of M 


#ofJ 


Montgomery * Montgomery 


0 


2227.2 


0 


Simultaneous (w = 2) 


13 


2098.8 


3 
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A Algorithms for Recovering ^-Coordinate 
(AfRne Version) 

Algorithm 3 : algorithm for recovering y-coordinate 

INPUT 

OUTPUT xi,yi 

1. T\ — X X Z\ 11. T\ i — T\ X 'Zj\ 

2. T 2 ^ Ai + Ti 12. T 2 ^ T 2 - Ti 

3. T3 *r- X\ — T\ 13. T2 •<— T2 X Z 2 

4. T3 ^ A3 X T3 14. T2 ^ T2 — Ts 

5. T3 ^ T3 X A2 15. Ti^ 2 B xy 

6. — 2A X .^3 16. Ti i — Ti x Zi 

7. T2 ^ T2 + Ti 17. Ti ^ Ti X Z2 

8. T4 1 — XX Ai 18. T3 — T\ X Z\ 

9. T4 ^ T4 + Zi 19. T3 ^ l/Ta 

10. T2 •<— T2 X T4 20. t/i •<— T2 X T3 

21. Ti^TiX Ai 

22. xi — T\ X T3 

The computation amount of Algorithm 3 is 14M + S' + /. 

Algorithm 4 : algorithm for recovering y-coordinate 
INPUT x,y,Xd,Zd,Xd+i,Zd+i 
OUTPUT Xd,Vd 

1. Ti i — Xd XX 11. T2 i — T2 X Zd 

2. Ti •<— Ti — Zji 12. -<r- T 2 X Xd 

3. T2 — Zd X X 13. T2 i — T2 X 

4. T2^ Ad-T 2 14. T2^ I/T2 

5. T3 •<— Aj;_|_i X T2 15. Xd ^ T2 X T4 

6. T2 — T 2 X T 2 16. T4 — T\ X Zd+i 

7. T2 1— T2 X Afi+i 17. Ti •<— Ti X Ti 

8. T 2 T 2 X Zd+i 18. T2 •<— T3 X T2 

9. T2 •<— T2 X y 19. Ti •<— T3 + T4 

10. T2 ^ T2X B 20. T3 ^ T3 - T4 

21. Ti ^ Ti X T3 

22. 2/d ^ Ti X T 2 

The computation amount of Algorithm 4 is 15M + 2S + /. 

B Proof of the Propositions 

Proof (of Theorem 2). Since P2 = Pi + P, the x-coordinate X2 is computed as 
follows. 

u f yi-yV 4 

X2 = P — A — Xi — X. 

\xi - X ) 

Since Pi and P are on the Montgomery-form elliptic curve, it follows that By(^ = 
xf + Ax\ + xi and By'^ = x^ + Ax^ + x. We find the following equation with an 
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easy calculation from the equation above. 



X2 



x\ + X — 2Byiy + 2Ax\x + xix"^ + xx\ 
{xi - xf' 



The result follows from this equation. □ 

Proof (of Theorem 3). Since P 2 = P\ + P, P 3 = Pi — P and —P = {x, —y), the 
x-coordinates X 2 and x^ are computed as follows. 

X 2 = B ( — —A — xi—x (1) 

Vxi - xj 

X 3 = B -A-xi-x (2) 

Vxi - xJ 

We obtain the following equations with multiplication by {x\ — x)^ at the equa- 
tions (1) and (2). 



B {yi- y)^ = {x 2 + xi + x + A){xi- x)^ (3) 

B {yi + yf = {xn + xi + x + A){xi- xf (4) 

We obtain the following equation with subtraction the equation (3) from the 
equation (4). 



-iByiy = {x 3 - X 2 ) (xi - xf 



( 5 ) 



The result follows from this equation. 

Proof (of Proposition 1). For simplicity, we set 



Ot — [{Xm — Zm){Xn + Zn) + {Xm + Zjn){Xn ~ Zn)], 

P = [{Xm — Zjn){Xn + Zn) — {X^n + Zm){Xn ~ Zn)\- 
Then we have 



X Zni^n^ 

^ ~ Xm+nP^ 

_ Xm-nP'^Op 
Zm-nX^P"^ 

— ^m— n- 



(because of addition formulae) 



□ 



□ 
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Abstract. A variation of the Complex Multiplication (CM) method for 
generating elliptic curves of known order over finite fields is proposed. 
We give heuristics and timing statistics in the mildly restricted setting 
of prime curve order. These may be seen to corroborate earlier work of 
Koblitz in the class number one setting. Our heuristics are based upon a 
recent conjecture by R. Gross and J. Smith on numbers of twin primes 
in algebraic number fields. 

Our variation precalculates class polynomials as a separate off-line pro- 
cess. Unlike the standard approach, which begins with a prime p and 
searches for an appropriate discriminant D, we choose a discriminant 
and then search for appropriate primes. Our on-line process is quick and 
can be compactly coded. 

In practice, elliptic curves with near prime order are used. Thus, our tim- 
ing estimates and data can be regarded as upper estimates for practical 
purposes. 



1 Introduction 

An important category of cryptographic algorithms is that of the elliptic curve 
cryptosystems defined over a finite field Fp, see [9] for a recent overview. While 
there are many methods proposed for performing fast elliptic curve arithmetic, 
there is a paucity of efficient means for generating suitable elliptic curves. The 
methods proposed to date for curve generation mainly necessitate implementing 
complex and floating point arithmetic with high precision. However, this hinders 
the implementation of the proposed algorithms on simple processors with limited 
amounts of memory. In [13], Miyaji proposed a practical approach to construct 
“anomolous” elliptic curves; these elliptic curves, of order p over fields of char- 
acteristic p, have since been shown to be insecure, [14], [16], [19]. However, the 
idea of the construction can be applied to quickly find non-anomolous curves 
as well. We present such a variant of the method to construct elliptic curves of 
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patent applications for inventions described in this paper. 



Q.K. Kog, D. Naccache, and C. Paar (Eds.): CHES 2001, LNCS 2162, pp. 142-158, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




Generating Elliptic Curves of Prime Order 



143 



known prime orders. Our variant has less computational complexity in its online 
implementation than that proposed in the IEEE standards [7]. Heuristics and 
calculations show that our method is practical. 

Timing estimates for the Complex Multiplication (CM) method of generating 
elliptic curves seem difficult to find in the public literature. The above mentioned 
survey article [9] mentions that in practice the method is fast, but cites a timing 
result for a single curve. Our timing statistics are averaged over 1000 curves per 
discriminant. As to previous theoretical bounds on running times, it seems that 
Koblitz’s [8] conjectures and statistics for reduction of class number one CM 
curves defined over the rationals are taken to indicate that the CM method is in 
general speedy. We concur in our 6.3. 

We thank the referees for helpful comments and for pointing us to important 
entries in the literature. The second-named author thanks Professor G. Frey and 
his team at the Institut fiir Experimentelle Mathematik for clarifying some of 
the basics of the theory of elliptic curves. 

The paper is organized as follows. Section 2 summarizes the complex mul- 
tiplication curve generation method. In section 3, we explain our variant which 
requires less data size and computation, while avoiding the weakness of Miyaji’s 
method. Section 4 summarizes the method to construct the class polynomials, 
the most computationally intensive part of the CM method. In our approach, 
we pre-calculate a set of these and store the coefficients. Also in section 4, we 
give some experimental results which indicate the efficiency of our approach. In 
section 5, we provide more detailed implementation results. In section 6, we give 
heuristics for the number of trials necessary to find prime order elliptic curves. 
Section 7 is a brief conclusion. 

2 Complex Multiplication Curve Generation Algorithm 

For the ease of the reader, we summarize some basics of the theory of elliptic 
curves. 

An elliptic curve £ defined over a finite field IFp, where p > 3, can be given 
as 



£i(Fp) : y'^ = + ax + b a,b G Fp (1) 

Associated with £, there are two important quantities: 
the discriminant 



A = -16(4a^ -^276^) 

and the j-invariant 

j = 1728(4o)VA 



(2) 

(3) 



where A ^ 0. 

Lemma 1. Given jo G Fp there is an elliptic curve, £, defined over Fp such 
that j{£)= jo- 
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An elliptic curve with a given j-invariant jo is constructed easily. We consider 
jo ^ {0,1728}; these special cases are also easily handled. Let k = jo/(1728 — 
jo)) jo € IFp then the equation 

£: + 3kx + 2k (4) 

gives an elliptic curve with j-invariant j{£)= jo- 

Theorem 1. Isomorphic elliptic curves have the same j-invariant. 

Theorem 2. (Hasse) Let jf£(JFp) denote the number of points on the elliptic 
curve £i(Fp). If ff£{Wp) = p -\- 1 — t, then |t| < 2^p. 

Definition 1. (Twist) Given £: = x^ ax h with a,b GWp the twist of £ 

by c is the elliptic curve given by 

£ c '■ y^ = x^ ac^x + bc^ (5) 



where c G Fp. 



Theorem 3. Let £ be defined over Fp and its order be )(£{TFp) = p -\- 1 — t. 
Then the order of its twist is given as 



#^c(f;) 



(p-\-l — tifcis square in Fp 
\p-\-l-\-t if c is non-square in Fp 



( 6 ) 



For the above basics of elliptic curves, we refer to [18]. The following result 
is based upon work of M. Deuring in the 1940s. See [1] and [10]. 

Theorem 4. ( Atkin- M or ain) Let p be an odd prime such that 

4p = (- 7 ^ 



for some t,s € TA. Then there is an elliptic curve £ defined over F„ such that 
#£(Wp) = p+ 1 -t. 

An integer D which satisfies (7) for a given p is called a CM discriminant of 
p. Indeed, the curve £ has complex multiplication by the integers of 
Given such a D for a prime p, the j-invariant of the elliptic curve can be calcu- 
lated due to class field theory. Once the j-invariant is known, the elliptic curve 
with p-\-l—t points is easily constructed utilizing Lemma 1. Actually, the method 
gives an elliptic curve with either p-\-l — t or p-\-l-\-t points. If the constructed 
elliptic curve has p-\- 1 -\- 1 points, then one must take the twist of this elliptic 
curve to obtain an elliptic curve with p -\- 1 — t points. Fortunately, it is trivial 
to construct the desired curve when its twist is known, due to Theorem 3. This 
technique for constructing elliptic curves of known order is called the Complex 
Multiplication (CM) method. 

A detailed explanation of the CM method is given in the P1363 standards. 
One can also profitably refer to [2]. We summarize the method in the following: 
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1. Given a prime number p, find the smallest D in (7) along with t (s is not 
needed in the computations). 

2. The orders of the curves which can be constructed are = p + 1 ± t. 

Check if one of the orders has an admissible factorization (by admissible 
factorization we mean a prime or nearly prime number as defined in the 
standards). If not, find another D and corresponding t. Repeat until an 
order with admissible factorization is found. 

3. Construct the class polynomial Hd{x) using the formulas given in the stan- 
dards. (The class polynomial for a I? is a fixed monic polynomial with integer 
coefficients. In particular, it is independent of p). 

4. Find a root jo of Hd{x) (mod p). This jo is the j-invariant of the curve to 
be constructed. 

5. Set fc = jo/(1728— jo) {\nod p) dxid the cvxYewiWheE-. = x^ + 3kx + 2k. 

6. Check the order of the curve. If it is not p+ 1 — t, then construct the twist 
using a randomly selected nonsquare c G Fp. 

With the CM method, one may first fix a prime number p, and thereafter con- 
struct an elliptic curve over Fp. This has the possible advantage of allowing 
the use prime numbers of special forms, possibly permitting an improvement in 
efficiency of the underlying modular arithmetic for the curve operations. On the 
other hand, the method is efficient only when the degree of the class polyno- 
mial is small; in general, factoring a high degree polynomial is time consuming. 
Furthermore, the construction of the class polynomials requires multi-precision 
floating-point and complex number arithmetic. 



3 A Variant of the CM Method 

The variant is straightforward: Construct and store the corresponding class poly- 
nomials for D in D and search for primes whose CM discriminants are in this set. 
We thus avoid repeatedly calculating class polynomials; hence multi-precision 
floating and complex number arithmetic as well as the factorization of high de- 
gree class polynomials is avoided. Indeed, the original CM method as specified 
in the standards becomes inefficient if not impractical as the class polynomial 
degree becomes large. 

Our algorithm is thus: 

1 . Off-line: Determine a set T> of CM discriminants such that the corresponding 
class numbers are small. 

2. Off-line: Calculate and store the class polynomials of CM discriminants in 

V. 

3. Select randomly a CM discriminant D in 27 and obtain the corresponding 
class polynomial Hjj{x). 

4. Search for prime number p satisfying the equation dp = + Ds^. (First, we 

select random t and s values of appropriate sizes and then determine if p is 
prime) 
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5. Compute u\ = p+l — t and U 2 = p+l + t, the orders of the candidate elliptic 
curves and determine if either of them has an admissible factorization (i.e. 
is a prime or nearly-prime number). If not, go to Step 4 and pick another 
random pair of t and s. 

6. If ui has proper factorization set u = qi, otherwise u = q 2 - 

7. Find a root jo of Hd(x) modp (this is the j-invariant of the curve). 

8. Set k = jo/(1728 — jo) mod p and the curve of order ui or U 2 will be 

Sc ■ + ax + b (8) 

where a = 3kc^, b = kc^ and c G IFp is randomly chosen. 

9. Check the order of the curve. If it is u then stop. Otherwise, select a non- 
square number e G Fp and calculate the twist by e, Se{Fp) = x^ + ae^ + be^ . 

Our experiments and heuristics confirm that pairs p and u of the type sought 
can be found quickly. 

As stated in the introduction, the above is a generalizing variation of Miyaji’s 
simplification of the general CM method. Recently, A.K. Lenstra [11] has also 
suggested using restricted sets of discriminants. But, as Miyaji, Lenstra only 
considers the class number one candidate discriminants. 

4 Constructing Class Polynomials 

Although there are different methods to calculate class polynomials, we adopt 
that of [1], see also [4]. Let D = b^ — Aac be the discriminant of a quadratic form 

f{x, y) = ax"^ + bxy + cy"^ 

where a, b, c are integers. The quadratic form, f(x, y) is commonly represented 
by the compact notation [a, 6, c] . If the integers a, b, c have no common factor, 
then the quadratic form [a, 6, c] is called primitive. There are infinitely many 
quadratic forms of any possible discriminant. We reduce to a finite number by 
demanding that a root of f{x, 1) lie in a certain region of the complex plane. 
Let the primitive quadratic form [a, b, c] be of negative discriminant. Let t be 
the root of f{x, 1) which lies in the upper half-plane: 

T = (-6-1- '/D)/2a 

The [a, b, c] is a reduced form if r has complex norm greater than or equal to 
1, and 5R(t) G [—1/2, 1/2]. Given a discriminant D < 0, we can easily find all 
of the reduced quadratic forms of discriminant D. We then compute the class 
polynomial Hd{x) which is the minimal polynomial of the j(j). For each value 
of T, the j-value (denoted ji below) is computed as follows: 

j(/d) = (256/(r) + l)V/(r) 



where 



/(r) = A{2t)/A{t) 
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A{t) = g • [1 + ^(-l)"(g3"("+l/2 + ^3n(„-l/2)]24^ 

n>l 

and 

q = 

Finally, the class polynomial can be constructed by using the following formula: 

h 

hd{x) = ji ) 

i=l 

where h is the number of the reduced forms of D, commonly known as the 
class number of D. Since Hu{x) has integer coefficients one must use sufficient 
accuracy during the computations. 

Our approach, as stated earlier, is to construct class polynomials beforehand 
for given D values. We do this using some software tool specialized for mathe- 
matical calculations. In our implementation, we use Maple. Following [1], we set 
the precision for floating point arithmetic as follows: 

precision = 10 -I- ^ ' 




Here N gives the number of terms to keep in the calculations involving the 
various 2\(t). 

As stated earlier, other methods than the basic use of the j-function applied 
here can be employed to construct class polynomials. In each of these, one ob- 
tains some class-invariant polynomial for the CM discriminant D. One advantage 
of using different methods is to obtain class polynomials with relatively small 
integer coefficients. This is particularly important when the processor used to 
store the polynomial coefficients is of limited memory. 



5 Implementation Results 

We implemented the algorithm using the NTL number theory and algebra pack- 
age [17] on a 450-MHz Pentium II based PC. We restricted to t = 2u -|- 1 and 
s = 2w + 1 where v,w G "ZZ . Thus, the prime numbers found in this setting are 
of the form 



p = + V + -I- w)D + 



D+l 

4 



(9) 



where D satisfies 

D = 3 (mod 4). 
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Furthermore, D is chosen such that (D+l)/4 is odd, hence p is odd for any choice 
of V and w. Throughout, we avoid the imaginary quadratic field of exceptionally 
many units: We exclude D = 3. We obtained average times to find the prime 
p and prime u as well as to calculate the corresponding curve for the following 
values of D. Again, were u merely required to be nearly prime number, the search 
time for admissible pairs would decrease. 

For 

V= { 163, 403, 883}, 

the corresponding class polynomials are given in the following: 

Hiq3{x) = X + 640320; 

Hiosix) = - 108844203402491055833088000000 x 

+2452811389229331391979520000; 

Hsssix) =x^ + 167990285381627318187575520800123387904000000000 x"^ 
-151960111125245282033875619529124478976000000 x 
+34903934341011819039224295011933392896000. 

We obtained efficiency results for these three cases. When the class number 
is one, the class polynomial is of degree one; hence the root is obtained without 
any computation. In the two other cases, we must determine a root for each p of 
the quadratic or cubic polynomial, respectively. The results are given in Table 1. 



Table 1. Timings to build curves of known prime order. 



D 


class no 


bitsize 


Average time (s) 


Np 


Nu 


163 


1 


192 


1.22 


23 


11 


163 


1 


224 


2.29 


27 


14 


403 


2 


192 


1.57 


30 


14 


403 


2 


224 


3.29 


36 


21 


883 


3 


192 


1.63 


30 


14 


883 


3 


224 


3.01 


36 


19 



To find a root modulo p of a class polynomial takes approximately a constant 
time determined by the size of the modulus p and the degree of the polynomial. 
However, the time or the number of trials to find admissible pairs of p and u 
is of a more complicated nature. We have run our program repeatedly to build 
1000 different curves with each value of D in Table 1. In the table, Np indicates 
the approximate number of random pairs of v and w to be tried before a prime 
p = + v + {w'^ + w)D + {D + l)/4 is found. Similarly, fV„ is the average trial 

number of p of the form (9) to obtain a prime u. 

The method remains efficient for larger class numbers, as shown in Table 2 
and Figure 1. 
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Table 2. Timings to build curves of prime order for large class numbers. 



bitsize 


D 


class no 


Average time (s) 


Np 


Nu 


192 


555 


4 


3.54 


51 


35 




1051 


5 


2.78 


48 


26 




451 


6 


5.70 


86 


57 




811 


7 


4.61 


76 


44 




1299 


8 


5.91 


69 


59 




1187 


9 


7.35 


79 


72 




611 


10 


12.53 


126 


128 




1283 


11 


9.42 


99 


92 




1235 


12 


10.62 


107 


104 




1451 


13 


11.08 


106 


108 




1211 


14 


14.22 


124 


142 




1259 


15 


15.61 


132 


154 




1379 


16 


13.54 


135 


131 




1091 


17 


17.46 


159 


168 




1691 


18 


15.35 


136 


146 




2099 


19 


14.64 


128 


139 




1739 


20 


17.45 


150 


166 




25259 


72 


23.20 


140 


160 




37571 


95 


24.90 


152 


157 



Timings Np Nu 




Fig. 1. Performance of the method with increasing class numbers. 
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Table 2 clearly indicates that the admissible pair search time increases with 
the class number. Although this increase is not monotone — the timing for 
class number 10 is much higher than those for class numbers 11, 12, and 13 — 
it is reasonable to claim that the time needed to find proper pairs is directly 
proportional to the class number. This result is consistent with the theoretical 
considerations in [12]; see the next section for specific comments. The dependence 
of the construction process on the particular value of D seems to account for the 
deviation from simple monoticity. Note also, just as the theoretical heuristics of 
the next section suggest, that the time to find an admissible pair (p, u) decreases 
with the size of D. This can be observed in Table 3. See also the Figures 2, 3, 
4, 5, 6, 7. 



Table 3. Timings for various class numbers. 



field type [ 


bitsize 192 


bitsize 224 


class no 


D 


Average time (s) 


Nj, 


W 


Average time (s) 


Np 


W 


1 


11 


9.10 


95 


94 


16.20 


109 


113 




19 


3.86 


68 


39 


7.15 


81 


49 




43 


2.30 


46 


23 


4.19 


55 


28 




67 


1.87 


37 


18 


3.55 


44 


23 




163 


1.22 


23 


11 


2.29 


27 


14 


2 


35 


10.38 


105 


108 


15.74 


120 


no 




123 


3.49 


57 


35 


5.93 


64 


40 




187 


2.42 


45 


23 


4.31 


52 


28 




235 


2.09 


40 


20 


3.98 


48 


26 




403 


1.57 


30 


14 


3.29 


36 


21 


3 


59 


11.37 


121 


118 


21.17 


141 


128 




83 


10.01 


102 


104 


16.93 


118 


117 




107 


7.90 


92 


82 


14.33 


106 


99 




379 


2.63 


47 


25 


4.85 


56 


32 




883 


1.63 


30 


14 


3.01 


36 


19 


4 


155 


9.50 


99 


99 


16.14 


116 


112 




195 


6.46 


88 


66 


11.90 


105 


82 




259 


4.77 


78 


49 


8.46 


91 


58 




355 


3.76 


64 


37 


6.87 


77 


46 




555 


3.54 


51 


35 


6.54 


63 


44 


5 


179 


11.54 


113 


119 


20.65 


140 


142 




227 


9.33 


103 


97 


17.42 


122 


120 




347 


7.64 


83 


79 


12.64 


98 


86 




443 


6.65 


73 


68 


11.81 


86 


81 




1051 


2.78 


48 


26 


5.52 


55 


36 



Another important implementation aspect is code size. While one implemen- 
tation [15] of the full CM method [7] requires 204KB on a PC with Windows 
NT, our implementation with NTL requires only 164KB code space on the same 
platform. In fact, the code space can be made much smaller when code is writ- 
ten expressly for curve generation. For sake of simplicity, we have written such 
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ig. 2. Timings to build curves with increasing discriminants. 



Np 




Fig. 3. Number of trials for p with increasing discriminants. 
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Nu 




Fig. 4. Number of trials for u with increasing discriminants. 



Timings 




Fig. 5. Timings to build curves with increasing discriminants. 
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Np 




Fig. 6. Number of trials for p with increasing discriminants. 



Nu 




Fig. 7. Number of trials for u with increasing discriminants. 
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a program which treats only the class number one case. We found that only an 
extra 10 KB of object code space is needed for curve generation routines (as- 
suming that the basic subroutines for arithmetic operations needed for elliptic 
curve arithmetic are already available). 

6 Heuristics: Twin Primes 

and Prime Order Elliptic Curves 

6.1 Finding Primes 

The Prime Number Theorem states that for sufficiently large M , the number of 
primes in [2, M] is approximately M/ In M. But, with D as chosen, Ap = t^ + s^D 
expresses that p is a norm of an element in the ring of integers Q(-\/— Z?). The 
density of rational primes which are of this type is l/(2/i£)), where ho is the 
class number of (Q(-\/— £>). See [2,3,4]. We thus have that some M/(2/i£ilnM) 
primes of size up to M are of our type. 

With p < M, each pair (s, t) € TZp' gives an integral lattice point inside 
the ellipse of equation -|- s^D = M/4. Gauss, see for example [3], found an 
asymptotic formula for the number of lattice points interior to an ellipse. Here, 
this gives that the the number of the lattice points (s, t) with s, t both positive is 
L{M) = tt{M)\/D + Furthermore, our p are odd, we work with odd D 

and we desire the elliptic curve order u = p-|-l±ttobe prime, hence certainly 
odd. We thus only consider s and t odd. We thus search through a possible 
L{M)/A distinct values of + s^D for (s,t) interior to the ellipse. 

We search for prime p in specific ranges of the form [S', 2S] , and hence expect 
to have find a prime p after a total number of trials of {v, w) of some Np := 
c(7t/i£) In S)/-\/z/, for some constant c. Our experimental data confirms this, see 
Tables 1,2,3, where S is variously 2^®^ and 2^^^. 

6.2 Prime Order Elliptic Curves and Twin Primes 

The order of the curve we seek is m = p-l- 1 ±t, we ask for it to be prime. Now, p 
of our form is the norm of the element V = (t-|-s-\/— H)/2; note that t is the trace 
of V. The norms of P± 1 are easily seen to be the two possibilities for u. Thus, we 
are seeking twin pairs {V,V ±1). Indeed, the theory of complex multiplication 
ensures that associated to each pair of this form is an elliptic curve defined over 
IFp where p is the norm of V and whose exact number of points over this field 
equals the norm of P ± 1 . 

Although it is not known if there are infinitely many twin prime (principal 
ideal) pairs in any quadratic field, there are conjectures as to their numbers 
within bounded regions. This is also the case for twin rational primes, for which 
Hardy and Littlewood [6] conjectured that there are some C 2 dy 

twin primes of size less than M, with C 2 = 2rioddprimep ^ 1)^- This 

constant is approximately 1.32032. The integral l/(lnj/)^ dy is M/(lnM)^ x 
y(M), where 

y(M) is (1 -I- 2!/lnM -I- 3!/(lnM)^ -I- • • • -I- n!/(lnM)”“^) -|- 0((lnM)”“^). 
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Recently, Gross and Smith [5] have stated general conjectures for the number 
of twin primes in algebraic number fields. For Q(i/— D) with D congruent to 3 
modulo 8, their conjecture is that the number of twin primes of norm less than 
M should be 

pM 

P{D,M) = 2'/D/{TThl)) X P{D) X J l/{lny)'^dy, 

with P{D) = (1 — 1/{N{Q) — 1))^ where Q runs through the prime ideals 

of Q(\/— H) and N{Q) denotes the norm to Z. We thus see that the number of 
{v, w) which lead to elliptic curves of prime order over a prime field IFp with p 
of norm less than M should be 2^/T) / {Trh?]j) x M/(lnM)^ x fi{D) x "/{M). 

We bound f3{D) for D congruent to 3 modulo 8 by considering (unachievably) 
extremal splitting behavior of rational prime ideals (p) . Were every odd prime to 
split as the product of two distinct primes to such a field, then /3gpiit = 2/9 xC| = 
0.3874 If all odd primes were to remain inert, one finds Anert = 0.87299. 

We conclude that the number of trials of pairs (w, w) to find a prime pair (p, u) 
with p of norm in an interval [5", 2S'] should be Np x with approximately a 
constant times hjjh\ S / [3{D)\/1d . Again, our data confirm this. See in particular 
Figure 8. 



Expected trial numbers for twin primes (Np * Nu) 




Fig. 8. A comparison of theoretic and experimental values for Np x Nu- 
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6.3 Special Case: Class Number One 

The reduction of an equation over the integers ^ with respect to a prime number 
p is given by reducing each coefficient of the equation modulo p. This can be 
extended to equations of the rational numbers; and indeed to equations over 
algebraic number fields, where one reduces by prime ideals. 

Koblitz [8] used the Hardy-Littlewood heuristics to derive conjectures on the 
number of primes p for which the reduction of an elliptic curve defined over Q is 
an elliptic curve of prime order. In the class number one CM setting this number 
should be asymptotic to a constant times M/(lnM)^; the constant is explicit. 

In deriving his conjecture, Koblitz does not directly use twin primes in 
Q(-\/~.D). It would be very interesting to relate his constant to the Gross-Smith 
P{D) in this restricted case of class number one. We briefly review why there 
might well be such a relationship. 

An elliptic curve of j-value jo (modp) found with the CM method is the 
reduction of an elliptic curve defined over the complex numbers having j-value 
the corresponding root of the class polynomial Hjy(x). The reduction is with 
respect to a prime lying above p in the algebraic number held in which the 
root lies. In the class number one case, the single root of Hd{x) is in Z. The 
corresponding elliptic curve is defined over Q, and the CM method amounts to 
reducing the equation of this curve modulo primes which split to principal ideals 
in Q(-\/— D). Thus, Conjecture B of [8] then predicts the number of primes up 
to M (up to choosing twists) that give prime order elliptic curves. 

Table 4 gives a comparison between the Koblitz predicted value, the Gross- 
Smith twin primes value, and actual counts of twin primes and of anomolous 
primes. The anomolous values are primes naturally paired with themselves in 
our construction. (These are not counted as acceptable values of u in our timing 
and counts for the various N^.) Whereas the Gross-Smith formula should give 
the number of twins, the Koblitz formula reasonably interpreted should give the 
number of twins plus half the number of the anomolous curves. 

7 Conclusion 

We present a variant of the complex multiplication (CM) elliptic curve generation 
algorithm for Fp. We show that the new variant of the CM method allows off- 
line precalculation and therefore provides smaller, faster and more easily coded 
software on-line implementation. The theoretical analysis shows that there are 
numerous prime numbers in this subset and experimental results confirm that 
it is highly probable to construct a prime number belonging to this set with a 
fairly small number of searches. Our experiments also reveal the fact that the 
on-line performance of the modified CM method increases as the class number 
decreases. Another interesting result is that the new CM method performs better 
for larger discriminants of the same class. 
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Table 4. Twin primes: estimates and counts. 



D 


M 


Koblitz 


Gross-Smith 


Twins 


Anomolous 


11 


2000 


10.9 


12.1 


12 


4 




4000 


17.9 


19.2 


20 


4 




6000 


24.1 


25.5 


23 


5 




8000 


30.1 


31.3 


26 


5 




10000 


35.7 


36.7 


33 


5 


19 


2000 


24.2 


25.9 


23 


5 




4000 


37.9 


41.1 


36 


7 




6000 


51.2 


54.5 


51 


7 




8000 


63.1 


66.9 


63 


7 




10000 


75.2 


78.6 


78 


9 


43 


2000 


41.7 


46.1 


45 


4 




4000 


67.1 


73.2 


72 


5 




6000 


89.2 


97.0 


88 


5 




8000 


111.1 


119.0 


105 


6 




10000 


131.5 


139.9 


122 


7 


67 


2000 


54.8 


59.2 


56 


4 




4000 


88.2 


93.9 


91 


6 




6000 


117.2 


124.5 


125 


7 




8000 


144.8 


152.7 


157 


7 




10000 


172.4 


179.4 


189 


8 


163 


2000 


76.6 


94.3 


72 


4 




4000 


128.9 


149.6 


127 


5 




6000 


180.0 


198.3 


183 


6 




8000 


225.4 


243.3 


234 


6 




10000 


265.4 


285.8 


272 


6 
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Abstract. Croptography is a relatively new area of research which uses 
optical techniques to solve cryptographic problems. Optical computa- 
tions are characterized by extremely high speed and truly massive par- 
allelism, but they can not be used as general purpose computers. In this 
talk I’ll survey the held, and show that many natural problems in cryp- 
tography and cryptanalysis can be efficiently solved by simple optical 
techniques. In particular. I’ll describe a new way to break LFSR-based 
stream ciphers by using commercially available optical devices. 
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Abstract. In this paper a new low complexity parallel mnltiplier for 
characteristic two finite fields GT(2"‘) is proposed. In particular onr 
multiplier works with field elements represented throngh both Canonical 
Basis and Type I Optimal Normal Basis (ONB), provided that the irre- 
dncible polynomial generating the field is an All One Polynomial (AOP). 
The main advantage of the scheme is the resulting space complexity, sig- 
nificantly lower than the one provided by the other fast parallel multipli- 
ers currently available in the open literature and belonging to the same 
class. 



1 Introduction 

Finite fields have recently attracted a lot of attention due to the increasing 
number of cryptography and coding theory applications that require high per- 
formance finite field capabilities ([9]). Several new architectures have been pro- 
posed in order to fulfill the constraints imposed by specific purposes ([2,8,10]). 
Although different solutions can be compared from several points of view, time 
complexity and space complexity are, usually, the two most important parame- 
ters. The former is defined as the elapsed time between input and output of the 
circuit implementing the multiplier, and it is usually expressed as a function of 
the field degree m, the delay of an AND gate Ta and the delay of an XOR gate 
Tx- The latter, on the contrary, is defined as the pair of numbers Sa smd Ex, 
of AND and XOR gates used respectively. Although a manifest improvement in 
space complexity over the best known algorithm is still possible, because of an 
achievable asymptotic space complexity given by 0 (mlog 2 mlog 2 log 2 to) ([!]), 
these two parameters are characterized by an evident trade off. In fact, reducing 
the number of gates causes, in general, a corresponding increase in the execution 
time. So, if performance is the most critical parameter, we can accept a greater 
space complexity, in exchange for a reduction of the corresponding time delay. 
Conversely, in other applications such as those based on smart cards, mobile 
phones, or other portable devices, a reduced space complexity is often the most 
important design aspect. 

Because of these reasons we will focus on a special class of fast multipliers, 
characterized by a generator of type AOP, which can take advantage of the trade 
off between time and space complexity to achieve a space complexity significantly 
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lower than those offered by the traditional bit-parallel multipliers of the same 
class ([3, 4, 5, 6 , 7]), with a small increase in the corresponding time delay. In other 
words, a limited rise in the time complexity is accepted in order to obtain a more 
consistent reduction in the corresponding circuit area. 

Therefore the paper is organized as follows: section two introduces some use- 
ful preliminaries; section three provides an architectural description of the mul- 
tiplier when the field elements are represented through a Canonical Basis, while 
section four focuses on Type I ONB representations. The last section summarizes 
the results obtained and draws some conclusions. 

2 Preliminaries 

Characteristic two finite fields GF(2"‘) provide a plethora of methods to repre- 
sent field elements according to their particular application. Specifically, the 
two most classical schemes reported in literature are Canonical Basis (also 
called Standard Basis) and Optimal Normal Basis, though other strategies have 
recently been proposed ([2]). The former represents the generic field element 
a G GF{2'^) through the m-bit vector (oq, oi, . . . , am-i) with respect to the set 
( 1 , a, o? , . . . , 0 ;’”“^), where o? is the root of an irreducible polynomial of degree 
TO over GF(2), which corresponds to the expansion a(a) = the 

contrary, the latter specializes the set as ( 7 , 7 ^, . . . , 7 ^”* ^), where 7 is now the 
root of an N-polynomial of degree to over GF(2). In this case the expansion is 
therefore given by 0 ( 7 ) = (for more information see [9]). 

In order to reduce the complexity of the field multiplication special classes 
of irreducible polynomials have been suggested ([7], [10]). Among them, the 
AOP generators have been shown to be particularly interesting. An AOP is a 
polynomial characterized by the form p{x) = l-|-x-|-a:^-|-...-|- x™, which is 
irreducible if and only if to -I- 1 is prime and 2 is primitive modulo to -I- 1 ([9]). 
For instance, for to < 100 there are thirteen useful values: 2, 4, 10, 12, 18, 28, 
36, 52, 58, 60, 66 , 82, and 100. Moreover, each N-polynomial generating a Type 
I ONB is also an AOP ([9]). For this reason in the following we will focus on 
AOPs, discussing the advantages of this class in the context of both Canonical 
Basis and Type I ONB representations. 

3 Canonical Basis 

Let p{x) = l-|-a;-|-a;^-|-...-|- x™ an irreducible polynomial over GF{2'^), and 
let a(a) and 5(a) be two elements of GF(2"^), represented through the TO-tuples 
(oo, oi, . . . , ttm-i) and (5 q, 5i, . . . , 5m-i), with respect to the root a of p{x). 
Our goal is the computation of the field element (cq, ci, . . . , Cm-i) given by the 
product c(a) = a(a) • 5(a) G GF'(2’”). This product can be computed in two 
different phases: 

1 . computation of the ordinary product of two polynomials £{x) = a(x) ■ b(x) 
over GF(2) 

2. computation of the field product c{x) G GF(2’”) as c{x) = £{x) mod p{x) 
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3.1 Multiplication of Polynomials over GF(2) 

First, we observe that the degrees of the polynomials a(x) and b(x) are both 
< m — 1, therefore the degree of the polynomial £(x) will be, in turn, < 2m — 2. 
Formally we have: 

£(x) = a(x) ■ b{x) = £o+ £\x + £ 2 X^ + . . . + £ 2 m- 2 X^'^~'^ (1) 

This polynomial can be computed by means of a divide-and-conquer approach 
originally proposed to increase the speed of integer multiplications ([11])- Ac- 
tually this strategy, which we will slightly improve and extend respect to the 
results obtained in [8], in turn reminiscent of the Karatsuba-Ofman algorithm, 
has been also successfully applied in case of trinomial generators ([12]). 

More precisely, let us to observe that in this context m is surely even, thanks 
to the sufficient conditions that make p(x) irreducible. Therefore we can assume 
TO = 2N . As a consequence the polynomials a{x) and b{x) can be rewritten as 
a{x) = A{x) + x^B{x) and b{x) = C{x) + x^D{x) respectively, where 

A{x) = ao + cLix + a 2 X^ -I- ... -I- aN-\x^~^ 

i^(x) = (Xj\i -\- Ujv+lX -t“ Ct]\i^2^^ -t“ . . . -t“ U2N—1^^ 
and analogously 

C(x) = bo + b\X + b 2 X^ -I- ... -I- bN-iX^~^ 

D(x) = T bN-^.lX b]\[-^. 2 x'^ -I- ... -I- b 2 N—ix^ 

Therefore, the product £(x) can be computed as 

£{x) = a{x) ■ b{x) = A{x)C{x)+x^[B{x)C{x) + A{x)D{x)] +x^^ B{x)D{x) (2) 

which, introducing the following auxiliary polynomials 

Pac{x) = A{x) ■ C{x) 

Pbd{x) = B{x) ■ D{x) 

Pa+b{x) = A{x) + B{x) 

Pc+d{x) = C(x) + D{x) 



we can also express as 

£{x) = Pac{x) + x^ [Pa+b(x) ■ Pc+d(x) + Pac(x) + Pbd(x)] + x"^^ Pbd(x) (3) 

Eq.(3) compute the product a(x) ■ b(x) by means of three multiplications 
of polynomials of degree — 1, together with shifts and “lettings-down” of a 
powers. Specifically, the architectural structure of the multiplier can be organized 
as follows: 

- two circuits, composed of N XOR gates each, for the parallel computation 
of A(x) + B(x) and C{x) + D{x) 
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- three circuits, composed by iV^ AND and {N — 1)^ XOR gates each, for the 
parallel computation of A{x) -C{x), B{x) ■ D{x), and [A{x) +B{x)\ ■ [C{x) + 
D{x)\, the XOR tree depth is |"log 2 (iV — 1)], provided that the polynomials 
involved have at most degree N 

- one circuit, composed of 2N — I XOR gates, for the computation of A{x)C{x) 
+ B{x)D{x) 

- one circuit, composed of 2iV— 1 XOR gates, for the computation of [A{x)C{x) 
+ B{x)D{x)] + [A(a;) + B{x)][C{x) + D{x)] 

- one circuit, composed of 2N — 2 XOR gates, for the computation of ^{x) 
by means of the eq.(3), where each term, at this point, has been already 
pre-computed 

As far as the time complexity is concerned, it should be noted that the 
overall circuit is able to produce the output i{x) according to a time delay of 
Ta + TA(riog 2 (-^ ~ 1)1 + 3)- In fact, after a period of time equal to Tx, the 
intermediate values A{x) + B{x) and C{x) + D{x) will be available; therefore, 
when other Ta + Tx(|"log 2 (N — 1)] + 1) seconds have elapsed, the circuit will 
have also computed [A(a;) + B{x)\ ■ [C{x) + D{x)\, B{x)D{x), A{x)C{x) and 
A{x)C{x) + B{x)D{x), while waiting for other Tx seconds, also the computation 
of the term A{x)C{x)+B{x)D{x) + [A{x)+B{x)]-[C{x)+D{x)] will be completed. 
Therefore the result £{x), which now needs other Tx seconds to be reached, just 
requires a time complexity equal to Ta + Tjf (|'log 2 (Af — 1)] + 3). 

The overall characteristics of the algorithm, whose details have been pre- 
sented in Table 1, are respectively: 

S'x =3N^ + 2N -1 = -g TO - 1 (4) 

O' =Ta + Tx([log2(iV - 1)1 + 3) = + Tx([log2(TO - 2)1 + 2) 

which can be compared with those provided by a direct parallel multiplication 

= m? 

= w? — 2m + 1 (5) 

O' =TA + Tx{\log^(m-m 

It is evident how the former strategy exchanges a part of its time complexity 
in order to gain a | factor in the corresponding number of gates. 

Anyway, the values in (4) can be also further manipulated and expressed as 
(see also Table 1) 



— 3(X(j)m/2 


(6) 


{B'x)m = 5(B'x)m/2 + 4to — 4 


(7) 


(0')m = + 3Tx 


(8) 


where {C)d represents the complexity C of the multiplier, i.e. 


E'^, E'x and O', 



when the polynomials in input have degree at most d — 1, that is d coefficients. 

Eq.(6), (7), and (8) show that the product of two polynomials of degree 
< TO — 1 can be performed by means of three multiplications of two polynomials 
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Table 1. Time and space complexity to multiply polynomials over GF(2). 



Operation 


Ba 


Bx 


0 


Register Size 


A{x) -1- B{x) 




N 


Tx 


N 


C{x) + D{x) 




N 




N 


A{x)C{x) 




{N- 1)'^ 


Ta - f Tx [log2 (TV - 1)1 


2N- 1 


b\x)D{x) 




{N- 1)2 




2N- 1 


{A{x) -1- B{x)){C{x) -1- D{x)) 




{N- 1)2 




2N- 1 


A{x)C{x) + B{x)D{x) 




2iV- 1 




2N- 1 


A{x)C{x) + B{x)D{x)+ 




2Ai- 1 


Tx 


2N- 1 


-l-(A(a;) -1- B{x)){C{x) + D{x)) 








2N- 1 


i{x) 




2N -2 


Tx 


4N- 1 



of degree equal (at most) to about the half the original ones, plus a little overhead 
needed to combine the partial results and to obtain the final output. Moreover 
these three multiplications can be computed in a parallel way, and this is the 
reason why within the time complexity (8) does not appear the factor 3, present, 
in contrast, in (6) and (7). It should be also pointed out that this additional 
overhead is relatively small, being limited to 4m — 4 XOR gates in (7) and 
characterized by an additional time delay equal to 3Tx in (8). 

Moreover, provided that also m/2 is even, this strategy can be further ap- 
plied, in order to gain a further reduction in the gate count. For instance, as- 
suming that m is a power of 2, after k iterations we will obtain: 

+ 8m[(3/2)'= - 1] - 2(3^= - 1) (9) 

= {0')m/2>= + 3kTx 

These results show a clear trade off between time and space complexity. 
Therefore, to significantly reduce the number of gates we have to increase the 
corresponding number of iterations, although, as a side-effect, the time delay 
of the multiplier will also rise, just linearly in the same number of iterations. 
Of course, an interesting question is: how much can we iterate the algorithm, 
provided that we want to reduce the space complexity as much as possible? It 
is easy to see that the optimal stop condition for this recursion is m/2^ = 4, a 
value for which a parallel and direct multiplication is more advantageous over 
the recursive scheme. In fact, iterating the algorithm we obtain (27/^)4 = 12 and 
(27^)4 = 15, from which {S'rpQj,) = (Ex)4+(S'x)4 = 27, while (0')4 = Ta+4Tx- 
On the contrary, using a direct strategy we have {S'j ^)4 = 16 and {S '^)4 = 9, 
from which {SlpQj,) = {S'a)4 + {S '^)4 = 25, while {0')4 = Ta + 2Tx- Therefore, 
taking into account this stop condition, the corresponding complexities, in case 
of m = 2‘, will be: 

= 16-3'°S2™-2 

= 8m • [(3/2)'°S2 _ 1] + 7 . 3 iog 2 m -2 2 (10) 

(0')m =T^ + Tx(31og2m-4) 
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Table 2. Comparing different polynomial multipliers over GF(2). 



Scheme 




Sx 


Stot 


0 


Direct 


md 


— 2m -I- 1 


2m‘‘ — 2m + 1 


Ta +Tx\log2{m - 1)] 


m = 256 


65,536 


65,025 


130,561 


Ta + 8Tx 


m = 1024 


1,048,576 


1,046,529 


2,095,105 


Ta + WTx 


m = 2048 


4,194,304 


4,190,209 


8,384,513 


Ta + llTx 


Paar ([8]) 


m‘°S2 a 


6m‘°®2 _j_ 2 


7m‘°®2 -8m + 2 


Ta + 3Tx logj m 


m = 256 


6,561 


37,320 


43,881 


Ta + 24Tx 


m = 1024 


59,049 


346,104 


405,153 


Ta + 30Tx 


m = 2048 


177,147 


1,046,500 


1,223,647 


Ta + 33Tx 


Proposed 


16 • 3*°®2 


8m[(f)‘°s=’"-^ - 1] 
-1-7 • 3'°S2™-2 _|_ 2 


8m[(|)‘°S2— ^ - 1] 

+23 ■ 3'°S2 -g 2 


Ta + Tx (3 logj m - 4) 


m = 256 


11,664 


26,385 


38,049 


Ta + 20Tx 


m = 1024 


104,976 


247,689 


352,665 


Ta + 26Tx 


m = 2048 


314,928 


751,255 


1,066,183 


Ta + 29Tx 



which slightly improves the results reached in [8] . For a quantitative comparison 
see also Table 2, where it should be clear how our scheme pays a greater number 
of AND gates, if compared with [8], but in order to reduce both the overall 
number of gates Stot smd the time complexity 0. 

Now we have to generalize the previous results, in order to make the scheme 
suitable for generating AOPs. In fact, it is possible to employ the same strategy 
also when m is not a power of 2. To make the design very modular, we do not 
optimize the structure of the multiplier distinguishing the two cases, m even 
and odd, as done in [12]. In contrast, we simply expand the circuit registers to 
handle, at each step, polynomials of odd degree, that is with an even number 
of coefficients. As a consequence the following generalization can be derived and 
used to multiply polynomials of any degree (m > 4): 

= +7-3r'°S2™l-2 + 2 (11) 

{O'U =TA + Tx{3\log^m^-4) 

At the end of this first phase, the circuit outputs the coefficients of the 
product polynomial i{x), that is the bit vector . . . ,£ 2 m,- 2 }- The sub- 

sequent step will be the computation of field element c(x) as the remainder 
c(x) = £{x) mod p{x). 

3.2 Reduction Phase 

Let £{x) = {£q,£\, . . . ,£ 2 m- 2 ) the polynomial given by the ordinary product of 
a{x) and b{x). The current phase prescribes the computation of field element 
c{x) = a{x) ■ b{x) £ GF(2”^) as the remainder of the polynomial division of £(x) 
by the generator polynomial p(x). To speed up this computation it is possible 
to take advantage of the structure of the generator p(x). Thanks to the regular 
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form of this polynomial it is easy to express the coefficients of the field element 
Ci in terms of coefficients £i. Specifically, it can be shown that the field element 
c(x) = (co, Cl, , Cm-i) can be computed as 

{ £i “t“ £-m If ^ — Oj 1? • ■ • j 3 

£m-2 +£m if I = TO - 2 

£fji—l £m If ^ ^ 1 

Of course this step can be accomplished according to a time complexity equal 
to {0")m = 2Tx, while the relating space complexity is given by = 

2m — 2. 

As a consequence, the characteristics of the overall multiplier taking in input 
the two bit vectors (oq, ai, . . . , am-i) and (bo, 6i, . . . , bm-i) and producing, at 
the output of the circuit, the product element (cq, ci, . . . , Cm-i), will be given 
by: 

(I^aU = + (S'X)^ = 16 • 3r'°S2-l-2 

(Sx)m = (S'x)^ + (S'^U = 4 • EILT 3*rpi + 7 • + 2 to 

(0)„ =(e')^ + (e'')m =TA+Tx(i\\og^m'\-2) 

( 12 ) 

It should be noted that the final space complexities are notably lower than 
those currently available in literature and belonging to the same class ([3,6,7]). 
For a direct comparison see also Table 3, where it is evident how our scheme 
does exchange time complexity in order to gain a more consistent reduction 
in both the number of AND and XOR gates. Moreover, this gain grows as to 
grows. For instance, if to = 226, our multiplier provides a factor reduction, 
in the overall gate count, equal to 2.7, with respect to the best method ([3]), 
paying a corresponding time expansion factor of 2. On the other hand, in case of 
TO = 2026, the area reduction factor becomes 7.7, while the corresponding time 
expansion rises only up to 2.36. 

As an example, in Figure 1 is reported the scheme of the overall multiplier, 
when the generating polynomial is p(x) = 1 + x + + 

X® + X® + x^^. In this case the two inputs a(x) and b(x) have been rewritten as 
a(x) = A(x) + x^B(x) and b(x) = C(x) + x^D(x) respectively, where 

A(x) = Oq + a\x + Q2X^ + a^x^ + a4x'^ 

B(x) = 05 + aoX + OyX^ + OgX® + agx"* 



and analogously 

C(x) = bo + bix + b2X^ + &3X® + 

D(x) = 65 + box + b-jx^ + bsx^ + 6gx^ 

Therefore, the field element c(x) = a(x) ■ b(x) G GF(2^°) can be computed 
by means of three multiplication circuits for polynomials of degree 4, to obtain 
(A + B) ■ (C + D), A - C and B ■ D, plus some XOR gates, needed to recombine 
partial results (block Recombination), and to perform the reduction phase (block 
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Table 3. Comparing different Canonical Basis multipliers with generating AOPs. 



Scheme 




Sx 


0 


Itoh-Tsujii [7] 


+ 2m + 1 


m‘‘ + 2 m 


Ta + Tx [logj m + log 2 (m + 2 )] 


m = 226 


51,529 


51,528 


Ta + 16Tx 


m = 1018 


1,038,361 


1,038,360 


Ta + 20Tx 


m = 2026 


4,108,729 


4,108,728 


Ta + 22Tx 


Hasan et al. [5] 


m‘‘ 


+ m — 2 


Ta + Tx( [log 2 (m - 1 )] + m) 


m = 226 


51,076 


51,300 


Ta + 234Tx 


m = 1018 


1,036,324 


1,037,340 


Ta + 1028Tx 


m = 2026 


4,104,676 


4,106,700 


Ta + 2037Tx 


Ko 5 -Sunar [3] 


m‘‘ 


— 1 


TA + Tx(riog 2 (m-l)l + 2 ) 


m = 226 


51,076 


51,075 


Ta + lOTx 


m = 1018 


1,036,324 


1,036,323 


Ta + 12Tx 


m = 2026 


4,104,676 


4,104,675 


Ta + 13Tx 


Proposed 


16 • 3 l *°®2 ml -2 


4 ' X] ' ^ 3*1"—] 

^ 7 . 39062 ™l -2 + 2 m 


Ta + Tx( 3 |'log 2 m] - 2 ) 


m = 226 


11,664 


25,635 


Ta + 22Tx 


m = 1018 


104,976 


249,627 


Ta + 28Tx 


m = 2026 


314,928 


754,365 


Ta + 31Tx 



Reduction Phase). To make fully modular the circuit design (which could be an 
advantage, especially if m 10), we do not directly deal with these polynomials 
of degree 4. Instead we extend these polynomials by a single bit, in order to 
obtain polynomials of degree 5. This provides us with the possibility to further 
iterate the algorithm and to directly employ modules architecturally equivalent 
to the previous ones. In fact, each of these three products can be computed, in 
turn, by means of other three multiplication circuits for polynomials of degree 2, 
for the parallel computation of {A' + B') ■ {C + D'), A' ■ C and B' ■ D' , plus the 
XOR gates needed for the recombination. Conversely, the latter 9 polynomial 
multiplications are not further iterated, because of the lower time and space 
complexities provided by a direct multiplication. 



4 Type I Optimal Normal Basis 

The previous scheme can be also adopted in case of Type I ONB, following the 
smart strategy proposed in [3] . Specifically, let p{x) = 1 + x + + . . . + x™ an 

N-polynomial over GF(2’"), and let 0(7) and 6(7) be two elements of GF(2’"), 
represented through the m-bit vectors (oq, Oi, . . . , Um-i) and (bo, &i, . . . , 6^-1), 
with respect to the root 7 of p(x). Given that p(x) is also an AOP, the root 7 
satisfies the property 7™+! = 1, in fact 



j-m+l 



+ 1 



p(x) = l + X + X^ + ... + X™ 



X + 1 



(13) 
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Fig. 1. Multiplier for Canonical Basis over GF{2^^) with generating polynomial AOP. 



As a consequence, the set 

(14) 

can also be used as a basis for GF(2™). More precisely, (14) is nothing but a 
shifted version of the Canonical Basis, therefore the elements of GF(2™) repre- 
sented in Type I ONB can be quickly converted in Canonical Basis, and vice- 
versa, by means of a simple permutation of the components. In fact, thanks to 
the relation 7 ”^+i = 1, we can write the conversion 

a(7) = (15) 

by means of the permutation P defined as 

a' 2 * mod (m+i) = a* for f = 0, 1, . . . , m - 1 (16) 

Therefore, the elements to be multiplied in Type I ONB will be simply con- 
verted in Canonical Basis, through the permutation P, before entering the mul- 
tiplier. The output of the circuit, computed according to the complexities given 
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Table 4. Comparing different Type I ONE multipliers. 



Scheme 






e 


Massey-Omura [4] 




2m‘‘ — 2m 


TA+Tx(riog2(m-l)l +1) 


m = 226 


51,076 


101,700 


Ta + 9Tx 


m = 1018 


1,036,324 


2,070,612 


Ta + llTx 


m = 2026 


4,104,676 


8,205,300 


Ta + 12Tx 


Hasan et al. [6] 




— 1 


TA+Tx(riog2(m-l)l +1) 


m = 226 


51,076 


51,075 


Ta + 9Tx 


m = 1018 


1,036,324 


1,036,323 


Ta + llTx 


m = 2026 


4,104,676 


4,104,675 


Ta + 12Tx 


Kog-Sunar [3] 




— 1 


TA+Tx(riog2(m-l)l +2) 


m = 226 


51,076 


51,075 


Ta + WTx 


m = 1018 


1,036,324 


1,036,323 


Ta + 12Tx 


m — 2026 


4,104,676 


4,104,675 


Ta + 13Tx 


Proposed 


16 • 3i*°®2 ™l-2 


^ ^ rn\—‘S 2 * I" m 1 

grTcg^ »"l-2 _^2m 


Ta + Tx(3[log2 m] - 2) 


m = 226 


11,664 


25,635 


Ta + 22Tx 


m = 1018 


104,976 


249,627 


Ta + 28Tx 


m = 2026 


314,928 


754,365 


Ta + 31Tx 



in (12) and still represented in Canonical Basis, will be restored in Normal Basis 
thanks to the inverse permutation P~^. It should be noted that these two ad- 
ditional permutations do not increase the overall time and space complexity of 
the multiplier. In fact, P, and its inverse P~^, can be directly implemented by 
wiring the fan-in and fan-out of the circuit, without modifying any complexity. 
Therefore, our scheme is able to maintain the previously discussed gate count 
reduction also in case of Type I ONB. This reduction is significant, especially 
if compared with the one provided by the other fast parallel schemes currently 
available in literature ([4,6,3]), as reported in Table 4. Finally, also in this case 
the gain factor becomes more consistent as soon as m grows, as previously seen 
for Canonical Basis. 

5 Conclusions 

In this paper we have proposed a new low space complexity scheme for fast 
parallel multiplication of field elements represented through both Canonical and 
Type I Optimal Normal Bases. Specifically, the discussed strategy shows how 
to avoid quadratic space complexity, paying only a limited increase in the cor- 
responding time delay. As reported in Table 3 and 4, the proposed scheme of- 
fers a circuit complexity significantly lower compared to the other fast parallel 
schemes present in the open literature ([3,4, 5, 6, 7]). This characteristic makes 
the employment of this multiplier particularly suitable for applications charac- 
terized by specific space constraints, such as those based on smart cards, token 
hardware, mobile phones or other portable devices. 
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Abstract. We explore the use of subfield arithmetic for efficient imple- 
mentations of Galois Field arithmetic especially in the context of the 
Rijndael block cipher. Our technique involves mapping field elements to 
a composite field representation. We describe how to select a represen- 
tation which minimizes the computation cost of the relevant arithmetic, 
taking into account the cost of the mapping as well. Our method results 
in a very compact and fast gate circuit for Rijndael encryption. 

In conjunction with bit-slicing techniques applied to newly proposed par- 
allelizable modes of operation, our circuit leads to a high-performance 
software implementation for Rijndael encryption which offers significant 
speedup compared to previously reported implementations. 



1 Introduction 

In October 2000, the US National Institute of Standards and Technology (NIST) 
announced that it had selected the Rijndael Block Cipher [3] as the new Ad- 
vanced Encryption Standard (AES). In addition to being the new standard, 
Rijndael is a cipher that offers a good “combination of security, performance, 
efficiency, implementability and flexibility” [20]. It has already attained consid- 
erable popularity and acceptance. Rijndael is a block cipher with a block size of 
16 bytes, each of which represents an element in the Galois Field GE(2®). All 
operations in Rijndael are defined in terms of arithmetic in this field. 

Apart from Rijndael, there are several other instances of the use of Galois 
Field arithmetic in cryptography and coding theory [10]. The efficiency and 
performance of such applications is dependent upon the representation of field 
elements and the implementation of field arithmetic. It is common practice to 
obtain efficiency by careful selection of the field representation [9,10,11]. In par- 
ticular, it is well-known that the computational cost of certain Galois Field 

* As of April 2001, the author can be reached at Amazon.com, 605 5*^ Ave South, 
Seattle, WA 98104, U.S.A. 
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operations is lower when field elements are mapped to an isomorphic composite 
field, in which these operations are implemented using lower-cost subfield arith- 
metic operations as primitives [11]. Depending upon the computation involved 
and the choice of representation, there are costs associated with the mapping and 
conversion, and a trade-off has to be made between such costs and the savings 
obtained. The design task is to carefully evaluate these trade-offs to minimize 
the computational cost. 

In addition to an efficient hardware implementation, a good circuit design is 
also useful in obtaining fast software implementations. Using the technique of 
bit-slicing [2] a circuit with a small number of gates can be simulated using a 
wide-word processor. Multiple instances of the underlying computation are thus 
performed in parallel to exploit the parallelism implicit in a wide-word computer. 
This technique has been used in [2] to obtain a fast DES implementation. 

In this paper, we study the use of composite field techniques for Galois Field 
arithmetic in the context of the Rijndael cipher. We show that substantial gains 
in performance can be obtained through such an approach. We obtain a compact 
gate circuit for Rijndael and use its design to illustrate the trade-offs associated 
with design choices such as field polynomials and representations. We use our 
circuit design to obtain a simple and fast software implementation of Rijndael 
for wide-word architectures. The performances of both hardware as well as soft- 
ware implementations show large gains in comparison with previously reported 
performance figures. 

The rest of this paper is organized as follows. In Section 2, we detail the map- 
ping of Galois Field operations to composite field arithmetic. Section 3 outlines 
the data-slicing technique for realizing a highly parallel software implementation 
from a circuit design. In Section 4, we describe the mapping of Rijndael opera- 
tions to a particular class of composite fields. The selection of field polynomials 
and representations and the associated optimizations are discussed in Sections 
5 and 6 respectively. Finally, in Section 7 we present our results and a compari- 
son with previously reported performance figures for Rijndael. Drawings of our 
Rijndael encryption circuit are included in the Appendix. 



2 GF Arithmetic and Composite Fields 

Gomposite fields are frequently used in implementations of Galois Field arith- 
metic [9,10,11]. In cases where arithmetic operations rely on table lookups, sub- 
field arithmetic is used to reduce lookup-related costs. This technique has been 
used to obtain relatively efficient implementations for specific operations such as 
multiplication, inversion and exponentiation. Much of this work has been aimed 
at implementation of channel codes. The object has usually been to obtain better 
software implementations by using smaller tables through subfield arithmetic. 
Applications to hardware design (such as [10]) have been relatively infrequent. 

Our techniques are directed at both hardware and software implementations. 
We take advantage of the efficiency obtained by the use of subfield arithmetic, 
not merely in the matter of smaller tables but the overall low-level (gate count) 
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complexity of various arithmetic operations. The computation and comparison 
of such gains and cost is dependent upon several parameters - the overhead of 
mapping between the original and the composite field representations, the na- 
ture of the underlying computation and its composition in terms of the relative 
frequency of various arithmetic operations, and in case of software implemen- 
tations, the constraints imposed by the target architecture and its instruction 
set. Based on these parameters we select the appropriate field and representa- 
tion to optimize a hardware circuit design. As we shall see, there can be several 
objectives for this optimization, such as critical path lengths and gate counts, 
depending upon the overall design goals. The circuit design obtained can then 
be used to obtain parallelism in a software implementation by means of slicing 
techniques. 

As described in [11], the two pairs {GF(2”), Q(?/)} and {GF((2”)’”), P(a;)} 
constitute a composite field if GF(2”) is constructed from GF{2) by Q{y) and 
GF((2”)™) is constructed from GF(2”) by P{x), where Q{y) and P{x) are 
polynomials of degree n and m respectively. The fields GF((2”)’") and GF(2^), 
k = nm, are isomorphic to each other. Since the complexity of various arith- 
metic operations differs from one fieldto another, we can take advantage of the 
isomorphism to map a computation from one to the other in search of efficiency. 
For a given underlying field GF{2^), our gains depend on the choice of n and m 
as well as of the polynomials Q{y) and P{x). 

While we restrict our description to composite fields of the type GF((2”)’”), 
it is easy to see that the underlying techniques are fully general and can be used 
for any composite field. 



3 Slicing Techniques 

Bit-slicing is a popular technique [2] that makes use of the inbuilt parallel- 
processing capability of a wide-word processor. Bit-slicing regards a TT-bit pro- 
cessor as a SIMD parallel computer capable of performing W parallel 1-bit op- 
erations simultaneously. In this mode, an operand word contains W bits from 
W different instances of the computation. Initially, W different inputs are taken 
and arranged so that the first word of the re-arranged input contains the first 
bit from each of the W inputs, the second word contains the second bit from 
each input, and so on. The resulting bit-sliced computation can be regarded as 
simulating W instances of the hardware circuit for the original computation. In- 
deed, a bit-sliced computation is designed by first designing a hardware circuit 
and then simulating it using IF-bit registers on the rearranged input described 
above. 

A bit-sliced implementation corresponding to an iV-gate circuit requires N 
instructions to carry out W instances of the underlying computation, or N/W 
instructions per instance. This can be particularly efficient for computations 
which are not efficiently supported by the target architecture. Consider for in- 
stance GF{2^) multiplication on AltiVec [4]. The straightforward implementa- 
tion uses three table-lookups and one addition modulo 255. 16-parallel lookups 
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in a 256-entry table can be performed on AltiVec in 20 instructions. Thus, a set 
of 128 multiplications would require 488 instructions. In comparison, our 137- 
gate multiplication circuit translates into a bit-sliced implementation that can 
perform 128 multiplications in 137 instructions! 

The above computation ignores the cost of ordering the input in bit-sliced 
fashion and doing the reverse for the output. To evaluate the trade-off correctly, 
this cost has to be taken into account as well. In general, this cost will depend 
on the target instruction set. 

However, it is possible to think of scenarios in which a particular operation 
may be efficiently supported in an architecture. For example, if the AltiVec ar- 
chitecture were to provide an instruction for 16 parallel GF(2^) multiplications 
which use the underlying field polynomial of interest to us (a hypothetical but 
nonetheless technically feasible scenario since the critical path of the multipli- 
cation circuit is only six gates deep), then a direct computation would require 
only eight instructions, compared to the 137 required by the bit-sliced version. 

Now consider GF(2^®) multiplications on this hypothetical version of the 
AltiVec architecture. It is easy to see that the most efficient computation is 
neither a direct one, nor a bit-sliced version, but a byte- sliced computation, in 
which each GF{2^^) multiplication is mapped to a small number of GF{2^) 
operations, which are efficiently supported by the architecture in question. In 
general, the right “slice” to use would depend on the target architecture. 



3.1 Encrypting without Chaining 

Our Rijndael implementation processes 128 blocks of data in parallel. Tradi- 
tionally, such a scheme would be regarded as more useful for decryption than 
for encryption, since encryption is usually performed in inherently sequential 
modes such as Cipher Block Chaining or CBC [17,18,19]. The well-known CBC 
[17,18,19] is used as a defense against replay attacks [12]. In the CBC mode of 
encryption, parallel blocks would not be available for encryption except where 
data from many streams is encrypted in parallel. 

However, a new parallelizable variant of CBC [7] removes this limitation and 
makes it possible to use CBC encryption without the usual sequentiality. This 
makes it possible to utilize the high throughput rates of our implementation in 
conjunction with the popular CBC mode. 



4 Rijndael in a Composite Field 

Rijndael involves arithmetic on GF(2^) elements. In a straightforward implemen- 
tation, inverse, multiplication and substitution are likely to be the operations 
that determine the overall complexity of the implementation. The most common 
approach is to use table lookups for these operations. By mapping the operations 
into a composite field, we are able to obtain both a small circuit in case of a 
hardware implementation as well as smaller instruction counts and table sizes 
in case of software implementations. 
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For our Rijndael implementation, we work in the composite field GF((2^)^). 
We selected the field polynomial Q{y) = + y + 1 for GF{2‘^). For P{x), we 

consider all primitive polynomials of the form P{x) = + a: + A where A is an 

element of GF{2'^). There are four such polynomials, for each of which there are 
seven different transformation matrices to consider, one corresponding to each 
possible choice of basis. The criterion used by us to compare various choices is 
the gate count of the resulting Rijndael circuit implementation. 

Rijndael operations translate to the composite field representation as follows. 
H denotes the mapping from GF(2®) to GF((2"^)^), and T the corresponding 
transformation matrix — that is, H (x) = Tx. S' is a 4 X 4 matrix (the state) on 
which all operations are performed. 

— ByteSub transformation ; This has essentially two sub-steps: 

1 . = (Sy)"b In the composite field, H{Pij) = (H(S^))-b 

The calculation of an inverse is as follows. Every A G GF((2^)^) can be 
represented as A = oq -I- /3ai where /3^ + /3 + A = 0, and ag, oi G GF(2‘^). 
The inverse is B = A~^ = bg + /36i, bg,bi G GE(2^) , such that bg = 
(ag + ai)A~^ and 6 i = where A = ag(ag + ai) -I- Aaf. 

2. Qij — APij + c where A is a fixed 8 x 8 matrix and c G GF(2®). 
In the composite field, H(Qij) = H(AP^) -|- H(c) = TAP^ -|- H(c) 
= TAT-iH(Py) +H(c). 

— ShiftRow transformation : This step is independent of representation. 

— MixColumn transformation : This involves essentially the computation P^ = 
aiSij + a 2 S 2 j + agSgj + a 4 S 4 j, where ( 01 , 02 , 03 , 04 ) is a permutation of 
(01,01,02,03). In the composite field, 

H(P„) = H(oi)H(5y) + H(o2)H(524) + H(o3)H(53,) + H(o4)H(54,). 

The following observations are useful in the implementation - 

• If X G GP((24)2) then H(01) X X = X as the identity element is mapped 
to the identity element in a homomorphism. 

• H(03) = H(02) -bH(Ol). 

— Round Key addition : The operation is P = S'-!- A where K is the round key. 
In the composite field, H(P) = H(5) -|- H(A). Addition is simply an EXOR 
in either representation. 

The mapping of the arithmetic to the composite field together with judicious 
choice of the field polynomial gives us a substantially smaller circuit, as we shall 
see in Section 7. 

5 Optimizations 

All operations in the Rijndael block cipher are in GP(2®). As outlined in sec- 
tion 4, some of these GP(2®) operations have relatively inefficient gate circuit 
implementations and can be implemented more efficiently in some isomorphic 
composite field. One overhead in using subfield arithmetic is the cost of the 
conversion from the original to the composite field and vice-versa. To illustrate. 
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consider our Rijndael implementation, which uses subfield arithmetic. The cost 
of the transformations is dependent on the choice of the composite field. We 
describe below a method by which an efficient^ transformation matrix from the 
set T = {Tg,Ti, ....} of valid transformation matrices can be chosen. 

Let C{0) denote the cost of the operation 6, which in the present case is taken 
to be the gate count of the circuit implementation of 9. Depending upon design 
objectives and application, there can be alternative cost measures, such as the 
depth of the critical path, for instance. Let W(x) denote the hamming weight 
of X, i.e., the number of Is in the polynomial representation of x. 

The aim is to find T*, the most efficient transformation, and the correspond- 
ing choice of composite field. This is the composite field which minimizes the 
gate count of the Rijndael circuit implementation. Note that while comparing 
the cost for different transformations, we need to consider only those Rijndael 
operations whose costs are dependent upon the choice of composite field. The 
relevant operations are those which involve A or the conversion matrices (T and 

T-i). 

The costs of different operations are: 

— Transform ; This step involves computing H(S). 

Thus, C(Transform) = 16 x C(T.x). 

— ByteSub transformation : As noted earlier, this step consists of an inverse 
calculation and an affine transform - 

1. Pij = {Sij)~^. The only operation whose cost depends on the choice of 
field is the calculation of Aa^. So C{inverse) = 16 x C(A.x). 

2. Qij = ^Pij + c, or, in the composite field, H(Qy) = H(APy) -|- H(c) 

= TAP,, + H(c) 

= TAT^iH(P,,-) +H(c). 

Thus C{af fine) = 16 x (C(B.x) -I- W(H(c)) ), where B = TAT~^. 

— ShiftRow transformation : This step does not require any computation. 

— MixColumn transformation : 

As note earlier, this step is the computation 

H(P„) = H(ai)H(5i,) + H(a2)H(52,) + H(a3)H(53,) + H(a4)H(54,). 

Since H(01).x = x, C(TOzxCTm) = 16 x (C(H(02).x) -b C(H(03).x)). 

~ Round Key addition : The computation is H(P)=H(S')-|-H(A), so C(adciA'e?/) 
= 16 X C(T.x). 

— Inverse Transform : C{invTrans f orm) = 16 x C(T”^.x). 

T* depends upon whether a pipelined (unrolled loop) or iterative (loop not 
unrolled) Rijndael circuit is to be obtained. The former offers superior perfor- 
mance compared to the latter at the cost of a larger gate count. 

^ In terms of the Rijndael gate-circuit implementation. 

^ Note that lV(H(c)) is the number of not gates required to implement H(c)-|-x , 
where x G GP((2"')^). 
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The criterion for the best transformation can be represented as follows: 

T* = arg^i^( C{trans f orm) + n x C{inverse) + n x C (affine) 

+m X C{mixClm) + (n + 1) x C(addKey) + CfinvTransform)). 



i.e. 

T* = arg mi^( (n + 2) x C{T.x) + n x C(X.x) + n x (C(B.x) + W(H(c)) ) 

+m X ( C(H(02).x) + C(H(03).x) ) +C(T-ba;) ) 

where m and n are both 1 for an iterative circuit, and TZ and TZ — I respectively 
for a pipelined circuit, where TZ is the number of rounds as specified in the 
Rijndael cipher. 

Based on these considerations, we selected the polynomial P{x) = x'^ + 
X + That is, we chose A to be where lo is the primitive element of 
GF{2'^). The following transformation matrix maps an element from GF(2®) to 
the corresponding element in the chosen composite field: 

/I 0 1 0 0 0 0 0\ 

10101100 
11010010 
01110000 
11000110 
01010010 
00001010 
\1 1 0 1 1 1 0 1 / 

6 Finding a Transform 

A method for generating a transformation matrix to map elements of GF{2f) to 
GF((2")™) can be found in literature [11] for the case where all the field poly- 
nomials involved are primitive polynomials. However, in the case of Rijndael the 
field polynomial is R(z) = + z + 1 is a,n irreducible polynomial 

but is not primitive. Since the fields involved are small, we use an exhaustive 
search method that can find the transformation in question in case R(z) is irre- 
ducible but not primitive. The basic idea is to map a, the primitive element of 
GF((2")’”) to 7, a primitive element of GJ^(2”), such that field homomorphism 
holds. 

The algorithm is composed of the following three steps — 

1. Get a primitive element 7 of GF(2^) and map a* to 7* for i G [0..(2^ — 1)]. 

Note that this step preserves the multiplicative group homomorphism — for 
any i,j G [0..(2* — 1)], a* x maps to 7* x 7-’ = 7*+^. 

2. Perform the following check — Vi G [0..(2^ — 1)], if a’’ = o;*-|-l then 7'’ = 7*-!- 
1 . If so then we have the required mapping; else repeat this step for the next 
primitive element. 




178 



A. Rudra et al. 



This is to verify additive group homomorphism, which requires that 
Vt, j G [0..(2^ — 1)] a* = a* + 7 * = 7 * + 7 ^ . That is, 

a* = a* X (1 + => 7 * = 7 * x (1 + 7 -’“*). 

Multiplicative group homomorphism implies that it is sufficient to verify 
whether 

Vt,j G [0..(2'= - 1)], a*-* = l + ^ i + y-y 

3. The matrix, is obtained by placing in the column the element H(2*) 
in the standard basis representation^ for all i. 

7 Performance 

Our performance figures reported below are for Rijndael encryption circuit and 
software, which assume key size of 128 bits. 

Our core circuit for Rijndael encryption contains less than four thousand 
gates. For the purpose of comparison, we report numbers based upon a circuit 
with 520 I/O pins that uses multiple cores in parallel. 



Table 1. Circuit Performance Figures 





Transistor/Gate count 


Cycles/block 


Throughput 


Ichikawa‘S et al.[6] 


518K gates 


7 


1.95 Gbps 


Weeks et al.[13] 


642K transistors 


? 


606 Mbps 


Elbirt et al.[5] 


V 


6 


300Mbps@14MHz 


(256-pin I/O) 


? 


2.1 


1.938 Gbps@32 MHz 


Our hardware circuit 


256K gates 


0.5 


7.5 Gbps@32 MHz 




using 32 parallel cores 
252 gate levels 


(iterated) of 4k gates each and 



Table 2 lists cycle counts and target architectures for various reported imple- 
mentations. In our case, the numbers apply to any architecture that can support 
bitwise AND and EXOR in addition to LOAD and STORE operations. The three 
numbers we report correspond to architectures with effective datapath widths 
(number in parenthesis) of 256 bits, 384 bits and 512 bits respectively (this is 
perhaps the interesting range of architectures today) . The cycle count goes down 
with increasing datapath width. 

It may be mentioned that no minimization or synthesis tools were used for 
our circuit — the only minimization used is in the sense of section 5. The only 
gates in our circuit are XOR, AND and NOT gates. 



® Here 2* denotes the element whose bit representation contains all Os except a 1 in 
the ith place. For example for n = 4, m = 2, 2^ is the element 00010000, i.e., a. 

^ This circuit performs encryption as well as decryption. 
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Table 2. Cycle counts per block for software implementations 



Worley et al[16] 


284 (Pentium) 176 (PA-RISG) 124 (IA-64) 




Requires an 8KB table 


Weiss et al.[14] 


210 (Alpha 21264) 


Wollinger et al.[15] 


228 (TMS320C6x) 


Aoki et al.[l] 


237 (Pentium II) 


Our bit-sliced software’” 


170 (256b) 119 (384b) 100 (512b) 




Requires only EXOR, AND, L/S, and 2KB table 
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Appendix: Our Rijndael Ciruit 

Presented below are drawings of our gate circuit for Rijndael encryption. The 
figures appear in the order of the level of detail in them - Figure 1 showing the 
high level view of our circuit. 
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Plain text block Round Keys 




Cipher Text 



Fig. 1. This figure contains the high level view of the Rijndael encryption circuit. The 
transform function consists of 16 parallel ciruits for T* .x, where x G GF(2®) and T* is 
the matrix to convert elements from GF(2®) to elements of composite field as decided 
by section 5. Similarly, Inverse-transform consists of 16 parallel circuits for 
Circuits for the multiplication of a constant matrix with a vector are obtained from 
the method given in [11] 




128 



128r 

Round_n 



Transform 






Key_n 



128 



Fig. 2. This figure describes the rijndael-impl block in Figure 1 
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i 



128 



Byte_sub 



128 



Shift_row 



I 



128 



Mix_coloumn 



I 



128 



Add_Round_Key KeyJ 



128 




128 



Round_i (0<i<n) Round_n 

Fig. 3. This figure shows the composition of each round. Note that in our implementa- 
tion, n=10. Shiftjrow does not require any gate. Add_Round_Key is simply the EXOR 
of the corresponding bits of the two inputs 




Fig. 4. This figure shows the implementation of the Bytesub operation. Affine has 16 
parallel circuits for calculating T*A(T*)~^.a: -I- H(c) 
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LI _ L32 


i_97| i_128 


32 


32 


Linear_comb 




Linear_comb 




32 


o_lT o_32 


o_9W o_128 



Fig. 5. mix.column 




All datapaths are 8-bit wide 

Fig. 6. This figure describes the linear_comb operation from Figure 5. Add8 is simply 
the EXOR of the corresponding bits of the two inputs 
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H(03).x 



All datapaths are 4-hit wide 



Fig. 7. This figure shows the circuits for calculating H(02).a; and H(03).a;; Add4 is 
simply the EXOR of the corresponding bits of the two inputs. ConstjmulU evaluates 
the constant multiplication ai*.®, where u> is the primitive element of GF{2‘^) and 
X £ GF{2^). These circuits have been obtained from [11] 




All datapaths are 4-bit wide 



Fig. 8. This figure 8 shows the Inverses operation in Figure 4. Square4 is from [11]; 
MuU4 and Inverse4 are from [9] 
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Abstract. This paper describes an algorithm and architecture based 
on an extension of a scalable radix-2 architecture proposed in a previous 
work. The algorithm is proven to be correct and the hardware design is 
discussed in detail. Experimental results are shown to compare a radix-8 
implementation with a radix-2 design. The scalable Montgomery multi- 
plier is adjustable to constrained areas yet being able to work on any 
given precision of the operands. Similar to some systolic implementa- 
tions, this design avoid the high load on signals that broadcast to several 
components, making the delay independent of operand’s precision. 

Key Words: modular multiplier, montgomery multiplier, scalable ar- 
chitecture, high-radix. 



1 Introduction 

Several applications, such as RSA algorithm, [14] Diffie-Hellman key exchange 
algorithm [5], Digital Signature Standard [12], and Elliptic curve cryptography 
[6,9] use modular multiplication and modular exponentiation. The Montgomery 
Multiplication (MM) algorithm [10] provides certain advantages in the imple- 
mentation of modular multiplication. Multiple software and hardware designs 
have been developed using the algorithm. 

An aspect of cryptographic applications is that very large numbers are used. 
The precision varies from 128 and 256 bits for elliptic curve cryptography to 
1024 and 2048 bits for applications based on exponentiation [15]. Most of the 
hardware designs for modular multiplication are fixed-precision solutions. That 
is, the operands cannot exceed a fixed bit-size. Designs that can take operands 
with an arbitrary precision are researched in the ASIC [18] and the FPGA [2] 
realms. 

It is recognized that designing hardware requires making the area-time trade- 
off [21]. In the general case “faster means better”. However, an application where 
this rule is not valid can always be found. Therefore, it is important that the 
designers have several options or choices that they can choose from. 

* This research was supported by rTrust Technologies. 

The reader should note that Oregon State University has filed US and International 
patent applications for inventions described in this paper. 
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The basic idea of the scalable Montgomery multiplier has been presented in 
[18]. The main features of this multiplier are (1) the ability to work on any given 
operand precision at the kernel level, (2) be adjustable to any chip area, a (3) 
use a pipelined organization that reduces the impact on signal loads as a result 
of high precision of the operands. 

The first feature is unique in comparison to other designs. The ability to han- 
dle long-precision numbers with small precision operations has been done using 
conventional multipliers, and a control algorithm that uses these multipliers [7]. 
The general approach is to reuse a hardware core with a fixed precision, usually 
at most 32 or 64 bits. The current publications show conventional multipliers 
that do not exceed a precision of 100 bits [16,1]. The control algorithm is usually 
complex in this case and the increase in parallelism involve multiple datapaths 
and high complexity at the system level. Other solutions that use systolic array 
implementation are designed for a fixed precision and the implementation must 
be modified if a precision larger than the one originally considered is required. 

The second feature comes from the flexibility of the algorithm and hardware 
to be adjusted in both word size and number of processing elements. The more 
hardware is available, the better is the performance of the multiplier. Similar 
adjustment is also possible on algorithms based on conventional multipliers, at 
the cost already presented above. Beyond any doubt, cryptographic algorithms 
will be embedded in almost any application involving exchanging of information. 
Applications, such as smart cards [11] and hand-held devices require hardware 
designs restricted on area and power resources. 

The high load on signals broadcast to several hardware components is an 
important factor to slow down high-precision Montgomery multiplier (MM) de- 
signs. For this reason, the use of systolic structures have been considered by other 
researchers. The organization presented in this paper is not purely systolic, and 
has a flavor of serial-parallel implementation of the multiplication algorithm. 

In this work we present an evolution of the radix-2 algorithm proposed in 
previous papers, which lead us to a higher radix design of the system. This 
paper describes the issues involved in this design and the experimental results 
to compare with the former radix-2 design. 



2 High-Radix Word-Based Montgomery Algorithm 

The notation used throughout this text is shown in Table 1. 

Figure 1 shows the Multiple-word High-Radix (2^) Montgomery Multiplica- 
tion algorithm (MWR2^MM), a generalization of the MM algorithm presented 
in [18]. A full-precision High-Radix Montgomery algorithm has been presented 
and proven to be correct in [8]. To prove correctness of the algorithm in Figure 1 
we show that it is equivalent to the one presented in [8]. 

The parameter k changes depending on how many bits of the multiplier X 
are scanned during each loop, or the Radix of the computation (r = 2^). Each 
loop iteration (computational loop) scans /c-bits of X (a radix-r digit Xi) and 
determines the value qy, according to Booth encoding [3]. Booth encoding is 
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Table 1. Notation 



• M - modulus for modular multiplication; 

• X - multiplier operand for modular multiplication; 

• Xj - a single bit of X at position j; 

• Xj - a single radix-r digit of X at position j; 

• Y - multiplicand operand for modular multiplication; 

• N - number of bits in the operands; 

• r - Radix (r = 2^)\ 

• S - partial product in the multiplication process; 

• k - number of bits per digit in radix r; 

• qVj - coefficient that determines a multiple of Y which is added to the partial product 
S in the iteration of the computational loop; 

• qMj - coefficient that determines a multiple of the modulus M which is added to the 

partial product S in the iteration of the computational loop; 

• BPW - number of bits in a word of either Y , M or S'; 

• NW — [ - number of words in either Y , M or S; 

• NS - number of stages; 

• CS - carry-save; 

• Co, Cb - carry bits; 

• - operand Y represented as multiple words; 

• 0 ■ bits fc — 1 to 0 of the i*^ word of S. 



applied to a bit vector to reduce the complexity of multiple generation in the 
hardware. For radix-8 the Booth function for each digit is given as: 

Booth{Xi, Xi-i) = —4x1+2 + 2xi+i + xi + Xi-\ 

where Xi = (xi+ 2 , x^+i, Xi) is a radix-8 digit (i = km where m is an integer), 
Xj G {0, 1}, and Xi-\ is the most significant bit (MSbit) of the previous digit. 

For Radix-2 computation k = 1 and qy^ = Xj are used, making the algorithm 
equivalent to the one presented in [18]. Ca and Cb represent two carry bits that 
are propagated from the computation of one word to the computation of the 
next word. In order to make the least-significant /c-bits of S all zeros, qMjM is 
added to the partial product. This is required to avoid losing bits in the shift 
operation performed in Step 10. The value of qM, that satisfies this condition is 
determined by examining the least significant /c-bits of S generated at Step 4. 

In step 11 and 12 the most significant (MS) word of S is generated and sign 
extended. The use of Booth encoding may cause intermediate values of S to be 
negative. The final result in S, when Step 13 {final reduction step) is reached, 
is always positive and it can be a number greater than the modulus M . Its 
purpose is to reduce the result to a number less than the modulus. M is chosen 
as 2'^“^ < M < 2^ and the result is bounded as 0 < S' < 2M. Therefore, a 
single subtraction of the modulus will assure that S < M, just in the case when 
the final result in S is greater than or equal to the modulus. 

The MWR2^MM is a multiple-word version of a full-precision algorithm pre- 
sented in Figure 2, which is called in this work R2^MM algorithm. To obtain 
the R2^MM algorithm we transform the word-based sequence of operations into 
full-precision operations. It is shown in [8] that the requirement for qM is given 
as: 

qM * M = —S {mod 2*). 
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Step 

1 : 

2 : 

3: 

4: 

5: 

6 : 

7: 

8 : 

9: 

10 : 

11 : 

12 : 

13: 



S' := 0 

x-i 0 

FOR j := 0 TO AT - 1 STEP k 



qy^ = Booth{xj+k..j-i) 




(Ca,S(°)) := 


S(°> -t {qy, * y)™ 




:= S™i, 


,0 * {2^ - 


mod 2^ 


(a,s(°>) := 


S(°) -t (?M, * M)(°> 




FOR i := 1 TO NW - 1 




(C.,s«) 


:= Ca -f S('> -t {qy, 


* y)b) 


(C6,S«) 


:= Cb + Sd) -p {qM, 


* M)b) 




/c(0 c(^— 1) 

Wk-1..0^ ^BPW-l..k 


) 


END FOR; 






Ca -.= Ca or 


a 




g{NW-l) 


sign ext {Ca, 


-1) 



END FOR; 

IF S > M THEN S := S-M 
END IF; 



Fig. 1. Multiple-word High-Radix (Radix-2^ ) Montgomery Multiplication 
(MWR2'“MM) Algorithm. 



This requirement can be also rewritten as 

Sk-i..o + 9 m * Mk-i..o = 0 mod 2'^. 

The latter equation is another representation of the requirement that the last k 
bits of S must be zeros. The Step 5 is equivalent to this requirement as shown 
below: 



9Mj = *S'fc-i,,o * (2^ — .Mfc-i..o) 

QMj = Sk-i..o * q) mod 2^= 

Sk-i..o = S mod 2^, Mk-i..o = M mod 2^ 



Step 

1: S := 0 

X-i := 0 

2: FOR j :=0 TO N-1 STEP k 

3: qYj = Booth{xj+k..j-i) 

4: S := S -I- qy^ * Y 

5: qMj ■— Sfc_i..o * (2*^ — q) mod 2^ 

6: S := sign ext. (S -|- qM, * M)/2^ 

END FOR; 

7: IF S > M THEN S := S - M 

END IF; 



Fig. 2. High-Radix (Radix-2'^ ) Montgomery Multiplication (R2*'MM) Algorithm. 
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QMj = S * {—M mod 2^ 

QMj * M = —S mod 2* 

It is also easy to show that 

y= ^ {2y*qY„ 

j=0 

from the Booth encoding properties. 

The last two equations show that the coefficients qy^ and qMj are determined 
the same way as in [8], which makes both algorithms equivalent. In [8] there are 
requirements for X and Y that determine the boundaries for the result S. There 
are no such requirements in the R2*MM algorithm. The R2^MM algorithm 
inherits the boundaries for the result from the original MM algorithm. 

3 High-Radix Montgomery Multiplier — System Level 

For high-precision computation it is beneficial to divide the multiplicand Y, the 
modulus M and the result S into words [18]. The approach keeps the gates 
and the wire delays inside reasonable boundaries. With operands’ precision of 
thousands of bits, a conventional design to multiply all the bits at once would 
have a high number of pins, increased fan-in for the gates, high gate loads, and 
gate outputs driving long wires. 

The multiplications {qy * Y)^*'> and {qM * shown in the MWR2^MM 

algorithm can be implemented by multiplexers (MUXes) and adders. The shifting 
operation in Step 10 is simple in hardware. Additions can be done using Carry- 
Save Adders (CSA), and keeping S in redundant form. With this approach the 
carries generated during addition are not propagated but rather stored in a 
separate bit-vector along with a bit-vector for the sum bits. The most complex 
operations of finding the coefficients qy and qM (steps 3 and 5) can be executed 
by table look-up. qy is pre-computed before the computational cycle begins since 
it depends only on the least significant k bits of X. This observation leaves the 
computation of qM in the most critical part of the algorithm as it is also pointed 
out by other authors [13,20]. 

The architecture of a Montgomery multiplier implementing the MWR2^MM 
algorithm is shown in Fig. 3. There are two main functional blocks: Kernel and 
10. Only the data path is shown. The Kernel’s datapath is where the computa- 
tion takes place according to the algorithm. A control block (not shown) supplies 
the signals to synchronize the system. 

The final reduction functional block computes the final result in a suitable 
form for the multiplier’s output, implementing step 13 of the algorithm. More 
details are provided later. 

The Kernel’s datapath gets as inputs BPW -hit words of Y, M and S (rep- 
resented in a Carry-Save form as SS and SC) and k bits of X. The outputs are 
BPW-h\t words of the new partial product S. The superscript star (*) indicates 
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lO/Memory 

DATAPATH 



SC* BPWb 

\ 

^ 

SSS BPWb 



KERNEL 

DATAPATH 



nNAL 

REDUCTION 

STEP 



Fig. 3. System Level Diagram of Modular Multiplier. 



that the signal is one word of the corresponding vector. For example, rep- 
resents one word of vector Y. These signals change every clock cycle. Depending 
on the kernel configuration (number of stages and word size) the operands must 
pass through the data path several times [18]. 

The signal Xj is a fc-bit signal. It provides the bits of X needed for Step 3 
of the MWR2^MM algorithm. 

The 10 block provides the interface with the user and the memory elements 
for the operands, modulus, and partial result. This block can be implemented 
in different ways depending on the application where the multiplier will be used 
and/or the system’s architecture in which the multiplier will be integrated. The 
solution for this block can be flexible and the only requirement for it is to meet 
the timing specifications for the kernel. Therefore, the architecture of this func- 
tional unit is out of the scope of this work. A detailed description of the signal’s 
timing in the interface between I/O and kernel is presented in [19]. 

4 Kernel Datapath and Reduction 

The kernel datapath is organized as a pipeline of cells (MMcell) separated by 
registers (Fig. 4). A stage consists of a MMcell and a register. The MMcell im- 
plements one iteration of the FOR loop (steps 3 to 12) in the MWR2^MM 
algorithm. Each stage gets as inputs one word of Y, M, SS and SC each clock 
cycle. Additionally, {NS * k) bits of X are transferred to the kernel over 2 * NS 
clock periods, where NS corresponds to the number of stages. Depending on the 
computation’s progress, k bits of X are loaded in a different stage every 2 clock 
cycles. Each stage needs these bits at different times. Thus, this signal is made 
common for all stages with internal control loading the signal in the right stage 
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at the right time. The MS bit of Xi is used to Booth encode as explained 

in Section 2, thus, a cell must store these two pieces of information in order to 
properly encode a radix-r digit of X. The datapath outputs one word of each SS 
and SC every clock cycle. The pipeline outputs are SS]^_Qjjrp and SC)^_Qjjrp. 



Xj 




stage 



Fig. 4. Top Level Diagram of the Kernel datapath. 



Each MMcell propagates the words of Y and M and the newly computed 
words of SS and SC to the next MMcell, which performs another computational 
loop of the Montgomery Multiplication algorithm and on its turn propagates the 
words of Y and M and the newly computed words of SS and SC, with a latency 
of 2 cycles. 

The reduction block implements the final reduction step in the MWR2^MM 
algorithm. The final reduction happens after the last iteration of the loop scan- 
ning the bits of X. During the intermediate iterations the final reduction block 
propagates the signals from the kernel datapath without operating on them. 
However, the design takes advantage of the word-serial output of the kernel dat- 
apath and implements the final reduction serially, on-the-fly, as the words of 
both vectors of the result are coming out of the kernel datapath. The condition 
S > M will not be known before the last pair of words for S is computed in 
the datapath. The final reduction block implements the computation for both 
conditions, S > M, when S' — M is generated, and S < M, when the result is 
correct. In both cases the Carry-Save to non-redundant conversion is required. 
Both resulting vectors will be stored in the place for SS and SC (the two bit- 
vectors of the intermediate result) in the 10 block. After the last pair of words 
of S is processed, a flag is set by the control circuitry indicating which condition 
is valid, S > M or S < M. The result will be in either SS or SC. A detailed 
implementation of the final reduction block is presented in [19]. 



5 Kernel Implementation 

The direct design of the kernel processing element leads to an organization shown 
in Figure 5(a). The figure shows the main blocks in the design: booth encoding. 







192 



A.F. Tenca, G. Todorov, and (J.K. K05 



Xj + 



Y~ 






register 





n 




Adderj 






3 


b 


1 

Table 





qM 



iKiSg;. 



(a) basic design 




input retiming 
register 



Fig. 5. Kernel cell organization: (a) first try, and (b) after re-timing. 



multiple generation, adders, and registers (shaded boxes). Shifting and alignment 
is done by proper combination of signals. 

The cell operates on k + 1 bits of the multiplier X (one bit is obtained from the 
previous scan) and one word of each the multiplicand (T), the modulus (M) and 
the partial product (S). Booth encoding is generated by a lookup table to find the 
coefficient qy^ ■ The negative multiples of Y are implemented by complementing 
their positive counter-pairs and adding a ’1’ (two’s complement sign change). The 
coefficient qm^ depends on the last k bits of the partial product S and the last 
k—1 bits of the modulus M (Step 5). Recall that M is odd. Before S is shifted to 
the right, the value qm^M is added to S (Steps 6 and 9). The coefficient qm, is in 
the range [0, 2^ — 1]. For radix-8, the greatest value happens when = “001”, 
and M 2 °o = “001” {qMj = 7)- The lowest value happens when S'^^q = “000”, and 
^2°o = “001” im, = 0). 

Multiple generation for high-radix designs is expensive because qy and qM 
may assume values that are not powers of 2. As an example, the bit-vector 2Y 
can be produced from Y by left-shifting Y by one bit. However, the bit-vector 
3T is produced by adding Y and 2Y. 

The critical path in the basic design is very long and makes the design of 
such high-radix circuit less attractive. The high radix is going to increase the 
table delay and size, and the multiple generation delay and size. To increase the 
performance of this system, re-timing was applied, resulting in the design shown 
in Figure 5(b). 



5.1 Improving the Performance Using Re-timing 

Using re-timing, pieces of combinational logic are relocated to other other parts 
of a sequential system, modifying the critical path. One problem with the first 
direct implementation of the high-radix algorithm is the long critical path, pass- 
ing through several modules, as shown in Figure 5(a). One can observe that the 
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determination of depends on k LSBits of the partial product from the previ- 
ous computational cycle, q, the k LSBits of and the coefficients qy^- 
If the word size for S is more than 2k bits the k LSBits of S for the next pipeline 
stage will be available well before the whole word 5'^°^ is available. The idea is 
to advance the information on the k least-significant bits (LSBits) of the shifted 
In the previous design, these bits were propagated between two registers 
with no logic operation done on them. Instead of simply propagating the bits, 
the logic determining qm is performed on them, as shown in Figure 5(b). 

The difference between these cell designs is that a portion of the first adder 
was moved to before the input registers, and this portion of the adder computes 
only the k LSBits of the not yet shifted partial product, which is required to 
compute qm- The fc-bit vector addM in the Figure represents these bits in non- 
redundant form, and is applied to the Table that generates qm in the next clock 
cycle, considering also k—1 bits of the modulus M. As a result of this hardware 
organization, all possible path delays will not exceed the delay of two adders and 
two MUXes. 

The computation done on the LSBits by the leftmost is also done for all the 
other remaining operand words. So, while the leftmost adder works on the LS 
bits of a word, the topmost adder (after the input register) should be working 
on the other bits of the same word. There is one clock cycle difference between 
the two circuits, and therefore, this situation must be considered carefully. 

5.2 A Radix-8 Design 

Without loss of generality, the details of this design will be explained based on 
a radix-8 implementation. The circuit in Fig. 6 shows the diagram for a Radix-8 
MMcell. 

One way of implementing the coefficients qy and qM is to split them into 
some components that will generate simple multiples and add these multiples in 
the adder. For r = 8, two values could be used. For example, qy = 3 would be 
split into 2 and 1, and the 3 * F multiple would be generated as2*F-|-l*y 
or4*F— without actually performing the addition or subtraction but 
using two bit-vectors, 2 *Y and 1*Y or 4 * F and — 1 * F in this example. It is 
efficient to choose only one of the components as a negative value. This is true 
because negative bit- vectors, like — F, are implemented by inverting the positive 
bit-vector, F in this case, and introducing a carry-in with a value of Since 
each four-to-two adder has only one carry-in input, only one of the components 
can be negative. 

Two multiplexers generate the multiples (<?ly, *F)(*) and {q2y. *F)W. The 
Booth encoding is done according to Table 2 in DEC-XJ functional block. As 
an example, (/2 * F) means that the F is multiplied by 2 and all the bits are 
complemented (or negated). Also, one can notice that the values 2 and —2 are 
formed in two different ways. This approach simplifies the decoding logic for Xj . 
The outputs of DEC-XJ are the control signals for the multiplexers as well as 
the carry-in bit for the first 4-to-2 adder (during the first computational cycle 
only). 
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Fig. 6. Radix-8 MM Cell. 

Because the coefficients qy and qm are split into two each, the adders need 
to have an extra input. The two four-to-two adders have a total of two carry- 
out bits propagating between sequential words of the partial product S. One 
carry-out is inserted at the LSB position of vector carryA. The other carry-out 
is introduced back to the same adder as a carry-in bit for the next word of S. 

The coefficient qmj depends on the 3 LSBits of the partial product S and 
the three LSBits of the modulus M. The product is represented by 2 vectors. 
There is one additional input bit, hidden-hit, which affects ■ The hidden-bit is 
generated by carry propagation in the least significant bits of the least significant 
word computation, which are zeroed in the process. Knowing that the LSB of 
M is always T’ and the LSB of carryA is always ’O’, qm, will depend only on 

eight bits: sumA 2 ..o, carryA 2 ..i, hidden-hit and m!^\. 

In Step 10 of the MWR8MM algorithm the partial product is right-shifted by 
three bits. Because carry-save representation (CS) is used for S, the LS words of 
the two bit-vectors {sumB^^\ carryB^^^) after Step 6 in the algorithm can be, for 
example: sumB^^'> = x.. x 110 and carryB^^'> = x.. x 010, where x represents 
any value of the bit in this position. The last three bits of S are equivalent 
to zeros when converted to a non-redundant form. However, data will be lost 
if these bits are shifted out without taking into account the carry propagation 
(110 -I- 010 = 1000). The carry bit generated in this case is the hidden-hit. 
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Table 2. Booth encoding for qy, the backslash means bit-complement. 
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Instead of using a carry propagate adder to obtain the hidden-bit, in radix 
8 the following observation is made: the last bit of carryB^^^ is always ’O’, 
therefore, to detect a hidden-bit it is enough to test if there is a 1 value in the 
second or third bits of either carry or sumB^^\ The circuit for the hidden- 
bit detection is reduced to sumB^^ + sumB[^^ . These two bits of sumB^^'' are 
stored into flip-flops, thus, the hidden-bit logic does not stand in the critical path 
for the whole cell. Since the hidden bit is found after the operation on the LS 
word is done, it is transferred from one cell to another, as part of the LS word. 
It can be inserted in the free LSBit position in carry and also participates 
in determining qM- 

If all eight bits are used for a lookup table for qM, the table will have 256 
entries. The number of entries can be reduced by assimilating the carries for 
sumA 2 ..o, carryA 2 ..i, and hidden-hit by a three-bit adder. The resulting three- 
bit vector is named addM : 

addM 2 ..o = (sumA 2 ..o + {carryA 2 ..i,Q) + (00, hiddenbit)) mod 8. 

which reduces the table for qM to only 32 entries. It is represented by the DEC.M 
functional block according to Table 3. The decoder outputs are the control signals 
for the multiplexers implementing (glMj and {q2Mj * 7Vf)(*). The decoder 

also has an output which is asserted ’1’ whenever qlMj is negative. This signal 
becomes a carry-in for the second four-to-two adder. 

The multiples of Y and M, like 2Y, 4Y, 2M, 4M, 8M, require that these 
operands be left-shifted. Caused by the word-serial scanning of this algorithm, 
this shifting requires some of the MSBits from the previous words of Y and M 
to be kept when the new words arrive. If it is the first word {first-cycle=’l’) then 
a number of zeros is shifted in to produce the needed multiple. Otherwise, the 
MSBits of the previous word are shifted in as the LSBits of the current word. 

As described at the end of the previous section, the leftmost adder is operat- 
ing on the LSbits of words j of S and qyY while the topmost adder is operating 
of the MSbits of word j — 1. This arrangement requires that the carry-out prop- 
agation among words of the partial sum A {carry A and sumA) be considered 
carefully. The carry-out of the topmost adder, net spillA2, is introduced imme- 
diately as carry-in for the leftmost adder. The carry-out of the leftmost adder is 
delayed one clock cycle before it is introduced as carry-in to the topmost adder. 
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Table 3. Decoding for qm - 
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6 Experimental Results and Analysis 

This section describes the experimental data obtained with the radix-8 Ker- 
nel designs and compares them with the radix-2 design. Although both radix-8 
designs were implemented, only the results for the re-timed radix-8 design is 
presented in detail. The complete data is presented in [19]. 

6.1 Synthesis and Simulation Environment 

The Mentor Graphics’ package of applications was used to generate this data. 
The target technology was set to AMI05_slow (0.5/rm) provided in the ASIC 
Design Kit (ADK) from the same company. A data-book for this technology 
is available at [4]. Before the designs were synthesized, they were simulated in 
ModelSim for functional correctness. The designs were described in VHDL, syn- 
thesized with Leonardo as flattened designs (no hierarchy), and laid-out using 
ICStation. This last tool provides RC parameter extraction. RC-extraction al- 
lows the determination of time delay values for each wire in the design, bringing 
further simulations closer to the real-silicon simulations. Using the information 
from ICStation and Leonardo, the designs were back annotated and verified 
with Velocity. The values presented in this section were obtained from several 
experiments. 

The kernel area depends on the number of stages in the pipeline (NS) and 
the word size (BPW). The area for the radix-8 kernel was obtained as: 

Akerneirs = 92 * BPW *NS + 269 *NS- 9.42 * BPW - 35.5. 

The total computational time for the kernel is a product of the number of 
clock cycles (Tclks) and the clock period (tp). The clock period is derived from 
the synthesis results, and will depend on the number of stages, the word size, 
and other parameters. The number of clock periods to complete a computation 
is obtained from the algorithm. 

Table 4 shows the critical path delay (tp) as a function of the number of 
stages for the re-timed radix-8 kernel as well as the number of bits per word in 
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the operands. These two parameters also determine the design area. The bold- 
faced figures in the Table show tested configurations. The rest of the figures are 
produced by linear interpolation. An increase in area leads to an increase in the 
critical path delay. This is due to increased wire lengths (parasitic resistance and 
capacitance) and fan-outs for the gates. A setup time plus clock-to-Q propagation 
time of 1.2ns for flip-flops is given for AMI05-slow technology. The hold time 
requirement is insignificantly small. The setup and hold time requirements will 
scale with the technology giving the same proportional effect on the clock period. 



Table 4. Critical path delay for radix-8 Kernel (nsec). 
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Two cases should be considered: (1) when NW < 2 * NS, and (2) when 
NW > 2 * NS. The variable NW = \ represents the number of words 

in the A^-bit operands with chosen word size of BPW bits [18]. Because of 
the extra register in the pipeline a word propagates through the pipeline for 
(2 * NS + 1) clock cycles For Radix-8, since 3 bits of X are used in each stage, 
r 3 *Aa 1 pipsliiis cycles are required. Equation 1 represents the total number of 
clock cycles needed for the re-timed Radix-8 Montgomery multiplication design 
as: 

NS +l) + NW +l ,if NW <2* NS .. 

* {NW +l)+2*NS ,if NW >2* NS ^ ^ 

It can be shown that when NW < 2 * NS adding more stages to the pipeline 
has somewhat unpredictable effect on the total number of clock cycles. It hap- 
pens because in this case the number of words NW has a small effect on the 
computational time, while the fraction has minimums and maximums 

as the number of stages NS changes. Thus, it may be the case that a design 
with more stages will be slower than a design with less stages. 

Figure 7 shows the total actual computational time {Tclks x tp) for N = 256 
and N = 1024, using designs with different number of stages {NS) and word 
size {BPW). The first observable minimum computational time happens when 
the boundary NW < 2* NS and NW > 2* NS is crossed. With further increase 
in the number of pipeline stages the computational time goes through a series of 



Tclks = 



~ N ' 
3*NS 
~ N ' 
3*NS 





198 



A.F. Tenca, G. Todorov, and (J.K. K05 



minimal and maximal values. The boundary NW > 2NS is crossed at a different 
number of stages for a different precision of the operands (a different number 
of words). Operands with precision 256 bits will require a smaller number of 
stages in the pipeline than operands with 1024 bits precision, in order to execute 
the operation in minimal time. The goal of choosing a design point is to have 
computational time for 256-bit precision close to its absolute minimal value and 
at the same time to have as small computational time for 1024-bit precision as 
possible. 





(a) (b) 



Fig. 7. Total time for 256-bit (a) and 1024-bit (b) operands for some values of NS and 
BPW. 



It can be seen from the data obtained in the experiments that the fastest 
designs are achieved with a word size of 8 bits. For this word size and 256-bit 
precision, the first optimal design point is for NS = 15. The area is 14964 NOR 
gates. Each additional stage adds about 1005 to the gate count as can be obtained 
from the area equation. Other optimal points for this design, represented as 
NS/area pairs, are: 16/15969, 18/17979, 22/21999, 24/24009 and 26/26019. 

For 1024-bits of precision, the time decreases asymptotically, with a faster 
decrease for a smaller number of stages. 



Table 5. Some design points for radix-8 kernel, BPW — 8, N = 256 and N = 1024. 
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1 14964 15969 17979 21999 24009 26019 1 
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Table 5 compares several design points for the radix-8 kernel with BPW = 8. 
The Table presents the design area and the ratio of the computational time 
related to the point NS = 15. It can be seen that the design point with NS = 22 
is very suitable since the computational time for 256-bit precision is very close 
to its minimal value. At the same time the computational time for 1024-bit 
precision is improved by 37% as compared to the point with NS = 15. With 
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further increase of the number of stages the computational time for 256-bit 
precision worsens while the computational time for 1024-bit precision does not 
improve significantly (only 2% per stage). 

A comparison of performance between the radix-2 design ([18]) and the radix- 
8 designs discussed in this paper is shown in Figure 8. The data shows the time 
to compute the modular multiplication for 256-bit operands as a function of 
the design area. For small areas, the radix-2 design (ul) performs as well as 
the radix-8 design with re-timing (v3). The basic design (v2) is worse than the 
radix-2 one. For areas of 10,000 gates or more, the radix-8 design with re-timing 
is better than the other two, which shows that the high-radix design has a better 
overall performance. 




Fig. 8. Areaxtime comparison between radix-2 (vl), radix-8 basic (v2), and radix-8 
with re-timing (v3) for 256-bit operands. 



7 Conclusion 

This paper presented the algorithm modifications and hardware implementation 
details of a high-radix implementation of the scalable modular multiplier pre- 
sented in [18]. A radix-8 design was used to exemplify the design process, and 
to obtain experimental results that show the viability of using this approach. 
Experimental data shows that the radix-8 scalable multiplier is able to perform 
as well as the radix 2 design for small areas, and better than the radix-2 design 
for larger areas. The re-timing technique applied to the high-radix design was 
critical to obtain a competitive solution. 
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Abstract. The performance of elliptic curve cryptosystems is primarily 
determined by an efficient implementation of the arithmetic operations 
in the underlying finite field. This paper presents a hardware architec- 
ture for a unified multiplier which operates in two types of hnite fields: 
GF(p) and GF(2™). In both cases, the multiplication of field elements is 
performed by accumulation of partial-products to an intermediate result 
according to an MSB-first shift-and-add method. The reduction modulo 
the prime p (or the irreducible polynomial p{t), respectively) is inter- 
leaved with the addition steps by repeated subtractions of 2p and/or p 
(or p{t), respectively). A bit-serial multiplier executes a multiplication 
in GF(p) in approximately 1.5- [log 2 (p)] clock cycles, and the multipli- 
cation in GF(2’") takes exactly m clock cycles. The unified multiplier 
requires only slightly more area than that of the multiplier for prime 
fields GF(p). Moreover, it is shown that the proposed architecture is 
highly regular and simple to design. 

Keywords: Elliptic curve cryptography, finite field arithmetic, iterative 
modulo multiplication, polynomial basis representation, bit-serial multi- 
plier architecture, smart card crypto-coprocessor. 



1 Introduction 

In the mid-eighties, N. Koblitz [9] and V. S. Miller [16] independently proposed 
using the group of points on an elliptic curve (EC) over a finite field in discrete 
logarithm cryptosystems. Elliptic curve cryptography can be used to provide 
digital signature schemes, encryption schemes, and key agreement schemes [10]. 
The primary advantage of elliptic curve systems over systems based on the mul- 
tiplicative group of a finite field is the absence of a subexponential-time algo- 
rithm that could solve the discrete logarithm problem (DLP) in these groups [3]. 
Consequently, an elliptic curve group that is smaller in size can be used, while 
maintaining the same level of security [13]. The result is smaller key sizes, band- 
width savings, and faster implementations. These features make elliptic curve 
cryptosystems especially attractive for applications in environments where com- 
putational power is limited, such as smart cards or hand-held devices. 
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The performance of an elliptic curve cryptosystem is primarily determined 
by the efficient realization of the arithmetic operations (addition, multiplication, 
and inversion) in the underlying finite field. Many practical implementations use 
projective coordinates [15] to represent points on the elliptic curve because they 
allow to perform a point addition/doubling without inversion. Therefore, copro- 
cessors for elliptic curve cryptography are most frequently designed to accelerate 
the field multiplication. 

1.1 Motivation for a Unified Multiplier Architecture 

An elliptic curve can be defined over various mathematical structures such as 
a ring or field. In cryptography only finite fields are used because they allow 
to store and handle the field elements in a manageable way. Due to standard- 
ization activities, two special types of finite fields have become very important 
for the implementation of elliptic curve cryptosystems: The prime field GF(p) 
and the binary extension field GF(2™). Various accredited standards bodies like 
the National Institute of Standards and Technology (NIST) recommended to use 
either GF(p) or GF(2"*) as the underlying finite field [19]. In order to promote 
interoperability between different implementations and to facilitate widespread 
use of well-accepted techniques, a crypto-coprocessor should operate in both 
types of finite fields. Therefore, it is an obvious idea to develop a unified mul- 
tiplier architecture which can perform multiplications in GF(p) and GF(2™). 
At a first glance, prime fields and binary extension fields seem to have dissimi- 
lar properties. However, the elements of either field can be represented using a 
bit-string. Furthermore, the arithmetic operations in both fields have structural 
similarities allowing a unified design. For example, a multiplication in GF(p) is 
performed modulo a prime p, and the multiplication in GF(2™) is done modulo 
an irreducible polynomial p{t) if polynomial basis representation is used. 

1.2 Previous Work 

In August 2000, E. Sava§ et al. introduced a unified multiplier which operates 
in both types of finite fields, GF(p) and GF(2™) [23]. From an algorithmic point 
of view, the multiplication in GF(p) is performed according to Montgomery’s 
method [17]. The introduction of the Montgomery multiplication for the field 
GF(2’”) in [11] opened them up the possibility to develop a unified multiplier 
architecture by taking advantage of the fact that the Montgomery multiplication 
is in both fields essentially the same operation. Their implementation utilizes 
inherent concurrency in Montgomery multiplication and uses an array of word- 
size processing units organized in a pipeline. Sava§’ architecture is highly scalable 
because a fixed-area multiplier can handle operands of any size. Moreover, the 
word-size of a processing unit as well as the number of pipeline stages can be 
selected according to the desired area/performance trade-off. 

Another interesting VLSI implementation was reported by J. Goodman et 
al. [6]. Their so-called Domain Specific Reconfigurable Gryptographic Processor 
(DSRGP) provides a full suite of arithmetic operations (including inversion) over 
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the integers modulo p, binary extension fields, and non-supersingular elliptic 
curves over GF(2’”), with operands ranging in size from 8 to 1024 bits. These 
operations are implemented using a single computation unit whose datapath cells 
can be reconfigured on the fly. The modulo multiplication is realized according 
to an iterated radix-2 version of Montgomery multiplication. On the other hand, 
the multiplication in GF(2™) is based on an iterated MSB-first approach. 

1.3 Our Contribution 

We introduce a multiplier architecture for unified (dual-field) arithmetic. The 
modulo multiplication proceeds in a serial-parallel fashion according to an it- 
erative approach, which means that the modulo reduction is performed during 
multiplication through concurrent reduction of the intermediate result. 

The main contribution of this paper is a modification of the classical MSB- 
first version for iterative modulo multiplication that allows a very efficient hard- 
ware implementation. Additionally, we propose a bit-serial architecture using 
carry-save adders for the accumulation of partial-products to an intermediate 
result given in a redundant representation. The modulo reduction operation is 
interleaved with the partial-product additions by repeated subtractions of once 
or twice the modulus. The circuit to decide the multiple of the modulus to be 
subtracted is very simple and requires only the two highest order bits of the re- 
dundant intermediate result as inputs. Gontrary to other designs, the subtrahend 
evaluation circuit of our multiplier does not cause a significant critical path. 

We will show that the bit-serial multiplier can also perform multiplications 
in GF(2'") by simply setting all carry-bits of the intermediate result to 0. The 
area-cost of the unified multiplier is only slightly higher than that of the mul- 
tiplier for the field GF(p), providing significant area savings when both types 
of multiplier are needed. To the best of our knowledge, an MSB-first bit-serial 
architecture for multiplication in GF(p) and GF(2™) has never been published 
before. Gompared to the Montgomery multiplication used in Sava§’ implemen- 
tation, the MSB-first iterative algorithm requires neither a transformation of 
operands into Montgomery domain nor precomputed constants. The bit-serial 
architecture has a linear array structure with a bit-slice feature. A high degree of 
regularity and mainly local connections make the multiplier simple to design. 

1.4 Paper Outline 

The remainder of this paper is organized as follows: Section 2 provides some 
background information on MSB-first techniques for radix-2 multiplication with 
interleaved reduction. Section 3 presents a modified version of the classical “shift- 
and-add” algorithm for modulo multiplication. The modified algorithm uses a 
redundant representation of the intermediate result and profits from a novel quo- 
tient estimation technique which is detailed in subsection 3.1. Section 4 covers 
arithmetic in binary extension fields GF(2"‘) using a polynomial basis represen- 
tation. The unified multiplier architecture for GF(p) and GF(2’”) is introduced 
in section 5. This section also describes the execution of a multiplication and 
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presents an estimation of the computation time for both types of finite fields. 
The paper finishes with a summary of results and conclusions in section 6. 



2 Preliminaries 

The finite field GF(p), also denoted as prime field of order p, is the field of residue 
classes modulo p, where the field elements are the integers 1. The 

field operations are modulo operations, i.e., addition and multiplication modulo 
the prime p. Beside the popular Montgomery multiplication [17] and the Barret 
modulo reduction method [1] , also binary and higher-radix algorithms for MSB- 
first iterative modulo multiplication have been proposed. 



2.1 MSB-First Iterative Modulo Multiplication 

A usual way of multiplying two integers A and B is done by scanning the mul- 
tiplier B one bit at a time, beginning with the most significant bit (MSB), and 

accumulating the partial-product A-B[i\ to the intermediate result. The product 
P is a 2n-bit integer if the operands are n bits long and can be written as 

C n-l \ ra-1 

Y^B[i]t\=Y,{A-B[i])T ( 1 ) 

i=0 / i=0 

The notation X\i] indicates the f-th bit of an n-bit integer X] A'[0] is the LSB, 
and X[n—1] is the MSB. After each addition of a partial-product, the inter- 
mediate result must be multiplied by 2 to align it to the next partial-product. 
Since a multiplication by 2 is a 1-bit left-shift in hardware, the described method 
is also known as shift-and-add multiplication. The shift-and-add multiplication 
typically results in a bit-serial architecture when implemented in hardware. Bit- 
serial multipliers offer a fair area/performance trade-off, which is an important 
aspect in the design of coprocessors for area-restricted devices like smart cards. 



Input: An n-bit modulus M (i.e., 2"“^ < M < 2"), a 
multiplicand A < M, and a multiplier B < M. 


Output: Result R = A- B mod M. 


1 


R^O 


2 


for i from n — 1 downto 0 do 


3 


R-i- 2-R + A-B[i\ 


4 


q ^ IR/M\ 


5 


R<— R — q-M 


6 


endfor 



Fig. 1: MSB-first shift-and-add multiplication with interleaved modulo reduction. 





206 



J. Grofischadl 



Figure 1 shows that the simple shift-and-add multiplication can be easily 
extended to perform a modulo multiplication. The modulo reduction of the in- 
termediate result R is interleaved with the addition steps and realized by sub- 
traction of the product q-M, whereby q is the quotient of R and the modulus M . 
The quotient q can be at most 2 since the term 2- R+A- B[i] is always smaller 
than three times the modulus M (on condition that A < M): 



q = 



R 

M 



with q G {0, 1, 2} 



(2) 



Therefore, the reduction of the intermediate result can be accomplished by sub- 
traction of M or 2-M (i.e., addition of the two’s complement of M or 2-M). 
However, two serious problems arise when implementing this algorithm: 

1. Addition of long integers can cause a significant delay due to carry propa- 
gation from LSB to MSB, which limits the clock frequency. 

2. The exact comparison of the intermediate result R to the modulus M in 
order to decide whether the quotient g is 0, 1 or 2 is also difficult to perform 
for very long integers. 

Various papers on the efficient implementation of MSB-first modulo multi- 
plication can be found in literature. An algorithm published by G. R. Blakley 
realizes the reduction of the intermediate result by one or two subtractions of 
the modulus [4]. E. F. Brickell presented an architecture which performs a multi- 
plication of two integers modulo p in |'log 2 (p)] +7 clock cycles [5]. This approach 
uses delayed carry adders to avoid the carry propagation delay, but has problems 
due to the difficulty of comparing long integers and conversion of the result from 
delayed carry representation to binary representation. C. D. Walter proposed 
another technique for speeding up modulo multiplication by scaling the modu- 
lus [26] . The modulus is scaled in such a way that a certain number of the most 
significant digits are fixed, resulting in a simplified reduction operation. How- 
ever, the cost of this method is precalculation and storage of the scaled modulus. 
Y.-J. Jeong et al. presented an architecture for iterative modulo multiplication 
that performs the quotient estimation by table lookups [8]. Their design also 
requires storage of some precalculated complements of the modulus, resulting 
in an increase in needed resources. The partial-parallel multiplier introduced by 
H. Orup et al. contains a quotient estimation circuit that estimates the 12 high- 
est order bits of the redundant partial sum, and then chooses an appropriate 
multiple of the modulus to be subtracted [20] . The most significant drawback of 
Orup’s architecture is a long critical path introduced by the quotient estimation 
circuit, which limits the clock frequency. Higher-radix methods for MSB-first 
iterative modulo multiplication have been reported in [12,18,24,25]. 



2.2 Carry-Save Adders 

The carry propagation in long integer addition is easily eliminated by the im- 
plementation of a carry-save adder (CSA). Carry-save adders are widely used 
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in arithmetic circuits due to their performance in terms of speed and silicon 
area [21]. An n-bit CSA consists of n full-adders (FA), and solves the carry 
propagation problem by using a redundant representation for the result (i.e., the 
carries are saved). This means that the result is not a single binary number, but 
is represented by two n-bit numbers instead: Rs (the sum bits) and Rc (the 
carry bits). The delay of a carry-save adder is constant (i.e., independent of the 
length of the operands) and only determined by the delay of a single full-adder. 
In many applications, the sum output Rs and the carry output Rc are latched 
or registered, either for synchronization purposes or for pipelining. 



A[3] SIS] C[3] X[2] 5[2] C[2] A[l] 5[1] C[l] A[0] 5[0] C[0] 




fc[3] i?c[3] 7?i[2] Rc[2\ fc[l] l?c[l] to[0] 7?c[0] 



Fig. 2: Block diagram of a d-bit carry-save adder. 



Figure 2 illustrates a d-bit carry-save adder. The basic principle of the carry- 
save addition is to reduce the sum of three binary numbers S, C, X to the 
sum of two binary numbers R$, Rc without carry propagation according to the 
following equations: 

Rs[i] = S'[i] 0 C[i] ® X[i] (3) 

i?c[*+l] = <S'[*]’C'[i] -I- S'[z]-AT[z] -I- (^[zj-X)!] with i?c[0] = Cin = 0 (4) 

Note that the operators in the previous equations are logical operators and not 
arithmetic operators. When using carry-save adders, the intermediate result R 
is not a single binary number anymore, but is given in a redundant represen- 
tation as a sum and carry pair (Rs,Rc) instead, whereby Rs denotes the sum 
part of the result, and Rc the carry part, respectively. Carry-save adders are 
advantageous if many subsequent additions have to be performed. 

3 Optimized MSB-First Iterative Modulo Multiplication 

The major hindrance of the bit-serial architectures for modulo multiplication 
described in subsection 2.1 is that they either require a costly quotient evaluation 
circuit or a circuit for performing comparisons of long integers. These circuits 
cause significant additional hardware and may limit the clock frequency due 
to a long critical path. Furthermore, some of the mentioned implementations 
need a large amount of storage for precomputed multiples of the modulus. If 
the modulus is to be dynamic, the stored modulus multiples must be updated 
whenever the modulus is changed. 








208 



J. Grofischadl 



Input: An n-bit modulus M (i.e., 2"“^ < M < 2"), a multiplicand A in the 
range of 0 < A < 2", and a multiplier B in the range of 0 < B < 2". 


Output: The result R in the range of 0 < i? < 2". A is possibly not fully 
reduced, i.e., R = A-B mod M k-M with k € {0, 1}. 


1 


(As, Ac) •«— 0 


2 


for i from n — 1 downto 0 do 


3 


(As, Ac) •«— 2- (As, Ac) + A-B[i] 


4 


while (As, Ac) > 2-2" do (As, Ac) (As, Ac) — 2-M 


5 


while (As, Ac) > 2" do (As, Ac) (As, Ac) — M 


6 


endfor 


7 


A As + Rc { red. to non-red. conversion } 


8 


if A > 2" then 


9 


(As, Ac) R — M 


10 


A •«— As + Ac { red. to non-red. conversion } 


11 


endif 



Fig. 3: Optimized version of the MSB-first iterative modulo multiplication. 



The most crucial operation of the classical MSB-first algorithm for iterative 
modulo multiplication is the calculation of the quotient q, which is the same as 
to decide whether the current intermediate result is smaller than M (and conse- 
quently q = 0), or bigger than M (and consequently g = 1), or bigger than 2-M 
(and consequently q = 2). This decision is difficult for long integers because an 
n-bit modulus M can vary between its minimum value Mmin of 2"“^ and its 
maximum value Mmax of 2” — 1: 

2”-i < M < 2" ^ = 2”-i and = 2" - 1 (5) 

Additionally, a redundant representation of the intermediate result does not 
make this task easier. An efficient solution for this problem is to compare the 
redundant intermediate result to 2" and to 2-2” instead of the exact values of 
M and 2-M, since these comparisons are simpler to implement in hardware, as 
will be demonstrated in subsection 3.1. 

Figure 3 shows a modified version of the shift-and-add multiplication which 
is optimized for hardware implementation. The intermediate result is written in 
redundant representation (i?s,i?c) to indicate that the additions and subtrac- 
tions should be performed by carry-save adders. Another interesting detail of the 
modified algorithm is the fact that the modulo reduction is not carried out “at 
once”, but is split into continued subtractions of 2-M and/or M. The subtraction 
of M and 2-M can be realized by addition of the two’s complement of M, and 
by addition of the 1-bit left-shifted two’s complement of M, respectively. When 
using a carry-save adder for the two’s complement addition, the subtraction is 
performed in constant time. During a modulo multiplication the intermediate re- 
sult is always in redundant representation. After the last multiplier bit i?[0] has 
been processed, the result must be converted from redundant into non-redundant 
representation. This conversion can be performed by a pipelined carry-lookahead 
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Table 1: Redundant number estimations. Rs and Rc are both (n+l)-bit numbers, 
i?s[n] is the MSB of Rs, and Rc[n] is the MSB of Rc- 



Rs[n] 


Rc[n] 


i?s[n-l] 


Rc\n-1] 


Rs + Rc 


Estimation 


0 


0 


0 


0 


Rs+Rc < 2"-i-t2"-i 


(Rs,Rc) <3-2"-i 


0 


0 


0 


1 


Rs+Rc < 2"-i-t2" 


0 


0 


1 


0 


Rs+Rc < 2"-t2"-i 


0 


0 


1 


1 


Rs+Rc > 2"-i-t2"-i 


{Rs,Rc) > 2" 


0 


1 


X 


X 


Rs+Rc > 2" 


1 


0 


X 


X 


Rs+Rc > 2" 


1 


1 


X 


X 


Rs+Rc > 2’^-t2" 


{Rs,Rc) > 2-2" 



adder (see section 5). If the non-redundant result R is bigger than 2”, one final 
subtraction of M is necessary to bound the result within the range of [0, 2"). It 
must be emphasized, however, that the operands A and B do not need to be fully 
reduced, but they must be smaller than 2" to ensure that the algorithm works 
correctly. A remaining problem is the comparison of the redundant intermediate 
result to 2-2” and 2”, respectively. In the next subsection we present an efficient 
solution for this problem by applying a special estimation technique. 



3.1 Redundant Number Estimation 

The modified algorithm in figure 3 requires a comparison of the intermediate 
result to 2-2" and 2” to decide whether or not a subtraction of 2-M or M has to 
be performed. For hardware implementation, this is a significant improvement 
over the first algorithm because it avoids the necessity for an exact comparison 
between the intermediate result and the modulus. Furthermore, the comparison 
to 2-2" and 2” can be easily realized by a novel estimation technique, in the 
following denoted as redundant number estimation. 

Table 1 shows a simplified logical truth-table to decide the two inequalities 
(Rs,Rc) > 2” and (Rs,Rc) > 2-2". For decision of the first inequality, only 
the two most significant bits of Rs and Rc need to be scanned, and for the 
second inequality only the MSB of Rs and Rc, respectively. Note that Rs and 
Rc are both (n-l-l)-bit numbers, consequently the MSB of Rs is and the 

MSB of Rc is Rc [n] . The hardware to decide the multiple of the modulus to be 
subtracted can be defined by the following two logical equations: 

subl = + Rc[n] + (i?s[n— 1] • Rc[n—l]) (6) 

sub2 = Rs[n] ■ Rc[n] (7) 

If subl = 0, then the intermediate result (Rs,Rc) is smaller than 3-2”“^ and 
consequently also smaller than 3-M. This estimation is correct for any value 
of M according to equation (5), even for M = Mmin- On the other hand, if 
subl = 1, the intermediate result is bigger than 2", and consequently it can be 
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estimated to be also bigger than M. Therefore, at least one subtraction of M is 
necessary, even if M = Mmax- 

For any n-bit modulus M satisfying 2”“^ < M < 2", the redundant number 
estimations observed from table 1 can be summarized as follows: 

subl = 0 and sub2 = 0 (Rs,Rc) < 3-M (8) 

subl = 1 and sub2 = 0 {Rs,Rc) > M (9) 

sub2 = 1 ^ (Rs,Rc) >‘2-M (10) 

The optimized MSB-first algorithm illustrated in figure 3 compares the in- 

termediate result (Rs,Rc) to 2-2” and 2” instead of the actual values 2-M 
and M. For this reason, it is possible that the intermediate result is not always 
fully reduced. But if the comparisons are performed according to the presented 
redundant number estimations, the algorithm guarantees that the intermediate 
result is always smaller than three times the modulus (i.e., smaller than 3-2"“^) 
before the next multiplier bit B[i] is processed. This is valid for any modulus M 
which satisfies equation (5), even for M = Mmin- 

After each addition of a partial-product, the modulo reduction is accom- 
plished by continued subtractions of 2-M and M. Of course this raises the 
question how many subtractions of 2-M and/or M will be (at most) necessary. 
Because the redundant number estimation guarantees that the intermediate re- 
sult (Rs,Rc) is smaller than 3-2”“^ before the quantity 2- (i?g, i?c) + 
is computed, the product 2-{Rs,Rc) is always smaller than 6-2”“^. Since the 
partial-product A-B\i] is smaller than 2" it is proven that the intermediate re- 
sult is smaller than 8-2”“^ before beginning the modulo reduction. Thus, for 
any modulus M satisfying equation (5), at most three subtractions of 2-M or M 
are necessary until {Rs,Rc) is smaller than 3-2”“^. On the other hand, a more 
precise quotient evaluation would reduce the number of subtractions. However, 
the proposed method benefits from the fact that the redundant number estima- 
tion does not cause a significant critical path and that no multiples of M need 
to be precomputed and stored. 

4 Arithmetic in Binary Extension Fields GF(2”^) 

The elements of GF(2"*) are polynomials of degree less than m, with coefficients 
in GF(2). For example, if a{t) is an element in GF(2*”), then one can have 

m— 1 

a{t) — ^ tti f = am-i + . . . + a 2 t^ + a\t + ao with G { 0, 1} (11) 

This binary polynomial can also be written in bit-string form as A[m — 1 . .0], 
whereby A[i] corresponds to the coefficient a^. Finite fields of characteristic 2 
are attractive for hardware implementation due to their “carry-free” arithmetic. 
The addition in GF(2™) is implemented as component- wise exclusive OR (XOR), 
whilst the implementation of the multiplication depends on the basis chosen [14] . 
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Input: 


An irreducible polynomial p{t) of degree m, a multipli- 
cand-polynomial a{t), and a multiplier-polynomial b{t). 


Output: Result-polynomial r[t) = a{t)-b{t) modp(t). 


1 




r{t) 0 


2 




for i from m — 1 downto 0 do 


3 




r{t) ■«— t-r{t) -1- a{t)-bi 


4 




if degree(r(t)) = m then r{t) r{t) — pit) 


5 




endfor 



Fig. 4: MSB-first iterative multiplication in GF(2"‘). 



The simplest representation is in polynomial basis, where the multiplication is 
performed modulo an irreducible polynomial of degree exactly m. 

A bit-serial polynomial basis multiplier for GF(2"*) has an area complexity of 
0{m) and computes a multiplication in m clock cycles. They have been well 
known since the early 1970s due to their exploration in coding theory [22], and 
later they have also been proposed for use in cryptography [2]. A recent pub- 
lication reports a bit-serial architecture which is able to perform additions and 
multiplications over a variety of binary fields up to an order of 2™ [7]. 

4.1 Addition 

The addition in GF(2'") is performed by adding the coefficients modulo 2, which 
is nothing else than bit-wise XOR-ing the coefficients of equal powers of t. Gom- 
pared to the addition of integers, the addition in GF(2™) is much easier as it does 
not cause carry propagation. It is well known that in the field GF(2’”) any ele- 
ment a{t) is its own additive inverse since a{t) + a{t) = 0, the additive identity. 
Gonsequently, addition and subtraction are equivalent operations in GF(2’”). 

4.2 Multiplication 

Multiplication in GF(2"‘) involves multiplying the two polynomials together 
(carry-free coefficient multiplication) and then finding the residue modulo a given 
irreducible polynomial p(t) . In general, the reduction modulo an irreducible poly- 
nomial p{t) requires polynomial division. For an efficient implementation it is 
necessary to perform the field multiplication without polynomial division. One 
possibility is to interleave the reduction modulo p(t) with the multiplication 
operation, instead of performing the reduction separately after the multiplica- 
tion of the polynomials is finished. This leads to a characteristic 2 version of 
the shift-and-add method, where the multiplication is realized by addition of 
partial-products, and the reduction is performed by subtraction of the irreducible 
polynomial. The pseudocode illustrated in figure 4 describes this algorithm. 

The multiplication of two polynomials a{t),b{t) G GF(2"‘) modulo an irre- 
ducible polynomial p(t) is done by scanning the coefficients of the multiplier- 
polynomial b{t) from bm-i to bo and adding the partial-product a{t)-bi to the 
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intermediate result r{t). The partial-product a{t)-bi is either 0 (if bi = 0) or 
the multiplicand-polynomial a{t) (if 6^ = 1). After each partial-product addi- 
tion, the intermediate result must be multiplied by t to align it for the next 
partial-product. The reduction modulo the irreducible polynomial p{t) is inter- 
leaved with the partial-product additions by subtraction of p{t) if the degree of 
the intermediate result is m, i.e., if the coefficient rm is 1. It turns out that the 
computation of r(t) = a{t)-b{t) mod p(t) requires m steps, and at each step we 
perform the following operations: 

— computation of t-r{t) (a 1-bit left-shift) 

— generation of a partial-product (logical AND between bi and a{t)) 

— addition of the partial-product (an (m-l-l)-bit XOR operation) 

— generation of the subtrahend (logical AND between Vm and p{t)) 

— subtraction of the subtrahend (an (m-l-l)-bit XOR operation) 

The required logical operations are AND, XOR, and 1-bit left-shifts, which makes 
a hardware implementation of this algorithm very straightforward. 

5 Multiplier Architecture 

When taking a closer look at the multiplication algorithms for GF (p) (figure 3) 
and for GF(2™) (figure 4), it is easily observed that these algorithms have some 
similarities. In both algorithms, one operand (the multiplier) is scheduled bit by 
bit, beginning with the MSB, and the other operand (the multiplicand) is sched- 
uled fully parallel. Both algorithms perform three basic operations: Addition of 
partial-products, 1-bit left-shifts of the intermediate result, and subtraction(s) 
of the modulus (or the irreducible polynomial, respectively) . The main difference 
is the way how the addition or subtraction is performed. An addition in GF(p) 
involves addition of integers and can be performed by carry-save adders, using 
a redundant representation for the result. On the other hand, the addition in 
GF(2’”) is a simple logical XOR operation. 



5.1 Implementation of the Field Arithmetic 

Figure 5 illustrates an arithmetic unit for implementation of the field additions 
and subtractions, respectively. All carry-save additions have to be performed 
with (n-l-l)-bit precision. The sum output Rs and the carry output Rc of the 
adders are latched on each half-cycle for synchronization purposes. Note that the 
circuit for generation of the partial-product as well as the circuit for generation 
of the subtrahend are not shown in figure 5. 

A subtraction is usually performed by adding the two’s complement of the 
subtrahend S, which can be realized in our case by addition of the bitwise com- 
plement of S and setting the initial carry Cin to 1. Therefore, addition and 
subtraction are essentially the same operation. It must be emphasized that the 
MSB-first algorithm from figure 3 guarantees that the intermediate result will 
never become negative, i.e., the Cout output of the carry-save adders can be 
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Fig. 5: Arithmetic unit of an n-bit unified multiplier. 



ignored. In the following we describe how this arithmetic unit can be used to 
implement a modulo multiplication. Later in this section it will be shown that 
the arithmetic unit can also perform the addition/subtraction in GF(2"‘). 

According to the MSB-first iterative algorithm for modulo multiplication, 
the processing of a multiplier bit B[i] takes place in the following way: The 
first carry-save adder at the top of figure 5 performs the addition of the partial- 
product A-B[i] to the current intermediate result. The sum output Rs and 
the carry output Rq of the first CSA are used to estimate the multiple of the 
modulus to become subtracted at the second CSA. This estimation is performed 
as described in section 3.1, and only the two highest order bits of Rs and Rc are 
needed to implement the logical functions of equation (6) and (7). Therefore, the 
hardware to decide whether to subtract 0-M, 1-M or 2-M can be implemented 
very efficiently and will not cause a long critical path in the arithmetic unit. 

The subtraction of M or 2-M is realized by addition of the two’s complement 
of M or 2-M to the output of the first CSA, which takes place at the second CSA. 
But one subtraction of 2-M or M may not be enough to guarantee that the 
intermediate result is within the range of [0,3-2”“^). Therefore, a control signal 
xsub is generated according to equation (6) in order to decide whether or not an 
extra subtraction of M or 2-M is necessary. If an extra subtraction is required, 
the outputs Rs and Rc of the second CSA are fed back to the first (upper) CSA 
(without a left-shift). For an extra subtraction, the multiplier bit B\i] must be 
masked off, so that no partial-product (i.e., zero) is added at the first CSA. After 
that, the extra subtraction of M or 2-M takes place again at the second CSA. 

If no extra subtraction is required, the processing of the multiplier bit is 
finished. The outputs of the second CSA are fed back to the inputs of the first 
CSA with a 1-bit hardwired left-shift. Rs and Rc are now correctly aligned for 
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addition of the next partial-product and the same procedure starts again. After 
the last multiplier bit has been processed, the sum and carry of the second CSA 
represent the redundant result (Rs,Rc) of the modulo multiplication. 

Generation of the partial-product. The partial-product A-B[i] is either 0 
(if B[i] is 0), or the multiplicand A (if B[i] is 1). Thus, the generation of the 
partial-product A-B[i] is simply done by a bit-wise AND operation between the 
multiplier bit B[i] and all the bits of the multiplicand A. 

Generation of the subtrahend. The subtrahend S = j-M,j G {0,-1,— 2} 
must be generated according to the requirements of the optimized MSB-first 
algorithm. In the presented arithmetic unit the subtraction of S is realized by 
addition of the bitwise complement of S and by setting the initial carry Cin of 
the CSA to 1. The control signals subl and introduced in section 3.1 indicate 
whether the subtrahend S has to be 0, M, or 2-M, and they can be used for 
generating the subtrahend-bits S'[f] according to the following equations: ^ 

S'[z] = subl ■ sub2 ■ M[i] + sub2 ■ — for i = 1 . . .n (12) 

S'[0] = subl ■ sub2 ■ M[0] + sub2 (13) 

Performing addition/subtraction in GF(2"^). The sum bit i?s[z] of a full- 
adder calculates the logical XOR of its three inputs (see equation (3)). By setting 
all carry bits of the adders to 0, the sum outputs i?s[z] of the adders provide the 
functionality of a 2-input XOR gate. This is exactly the functionality required 
for addition/subtraction in GF(2’”). Also the partial-products are generated 
in exactly the same way as described before, namely by a logical AND of the 
coefficient B[i] and all the coefficients^ of the multiplicand polynomial a{t). A 
reduction of the intermediate result is necessary whenever the degree of the 
result-polynomial is m, i.e., if Rs[m] is 1. The requirement for a subtraction 
of the irreducible polynomial p{t) is indicated by the control signal subl, since 
subl = if the carry bits Rc[i] are set to 0: 

subl = Rs[m] -I- 0 -I- (i?s[m— 1] • 0) = i?s[m] and sub2 = Rs[m] -0 = 0 

The control signal sub2 is always 0. As mentioned in subsection 4.2, the gen- 
eration of the subtrahend S' is a logical AND between the control signal subl 
and the bits of the irreducible polynomial, i.e., S[z] = subl ■ P[i] = Rs[m] ■ P[i]. 
The presented arithmetic unit provides exactly the functionality required for the 
multiplication in the binary extension field GF(2’”) when the carry bits Rc[i] 
are set to 0. 

^ The algorithm also works with the following control signals: subl = Rs[n] ® Rc[n], 
sub 2 = Rs [n] • Rc [n] , and xsub — Rs [n] -|- Rc [n] + {Rs [n— 1] • Rc [n— 1] ) . In this case 
the generation of the subtrahend bits Sfz] is simplified to the following equation: 
S[i] = subl ■ M[i] + sub2 ■ — 

^ According to the bit-string notation introduced in section 4.1, the coefficient Xi of a 
polynomial x{t) is denoted as X[i], 
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Fig. 6: Block diagram of the bit-serial multiplier architecture. 



5.2 The Unified Multiplier Architecture 

Figure 6 shows the bit-serial multiplier architecture, consisting of the (n-l-l)-bit 
arithmetic unit, four n-bit registers, and a pipelined tc-bit carry-lookahead adder 
[21], whereby w denotes the wordsize of the registers (usually 8, 16, or 32 bits). 
The I/O Register performs data transfers from and to the world outside the mul- 
tiplier. We prefer to provide a seperate register for I/O operations to ensure that 
the overall performance of the multiplier is not reduced by slow data transfers. 
The Modulus/IP Register is needed to store the bit-string representation of the 
modulus or the irreducible polynomial, respectively. Multiplicand and Multiplier 
Register are used for storing the current operands of the multiplication. Both 
registers can carry out rc-bit shift operations in LSB direction, register Mul- 
tiplier can additionally perform 1-bit shift operations in MSB direction (1-bit 
left-shifts). All four registers are connected through an n-bit bus. 

After the operands have been loaded into the corresponding registers, a mod- 
ulo multiplication takes place in the following way: The Multiplier register is 
shifted bit by bit in MSB direction to deliver the multiplier bits B\n—1] to B[0] 
to the arithmetic unit. The processing of the multiplier bits B[i] is performed as 
described in subsection 5.1. The control signal xsub is generated from the two 
most significant bits of Rs and Rc of the second CSA according to equation (6). 
Whenever xsub is 1, the arithmetic unit has to perform an extra subtraction and 
register Multiplier must stop the left-shift until xsub = 0. After the least sig- 
nificant multiplier bit B[0] has been processed, the redundant result {Rs,Rc) 
is loaded into registers Multiplier and Multiplicand, respectively. Note that the 
old values of the multiplier and multiplicand are not needed any more. Now the 
redundant result must be converted into non-redundant representation. This is 
done by the pipelined w-bit carry-lookahead adder (CLA) and requires \n/u]\ 
clock cycles plus the delay of the CLA (usually log 2 (ru) clock cycles). The out- 
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Table 2: Typical subtraction sequences of 2-M and/or M depending on the range of 
(Rs,Rc) after addition of the partial-product. 



Range of {Rs, Rc) 


Typical sub- 
traction sequence 


Clock cycles 


0<(Rs,Rc)<2"-i 


— 


1 


2"-^ < (Rs,Rc) < 2-2"-i 


— 


1 


2-2"-i < (Rs,Rc) < 3-2"-i 


— 


1 


3-2"-i < {Rs,Rc) < 4-2"-i 


M 


1 


4-2"-i < {Rs,Rc) < 5-2"-i 


2-M 


1 


5-2"-i < (Rs,Rc) < 6-2"-i 


2-M,M 


2 


6-2"-i < (Rs,Rc) < 7-2"-i 


2-M, 2-M 


2 


7-2"-i < {Rs,Rc) < 8-2"-i 


2-M, 2-M, M 


3 



put of the CL A is fed back to the Multiplier register. Since the non-redundant 
result may not be in the range of [0, 2”), an additional modulus subtraction and 
redundant to non-redundant conversion may be necessary. After the modulo 
multiplication has finished, the result resides within register Multiplier. 

In GF(2"*)-mode, a multiplication is performed in a similar way, except that 
no extra subtractions and no redundant to non-redundant conversions of the 
result are necessary. 

5.3 Performance Estimation 

Since the carry-save adders are separated by latches, the addition of a partial- 
product and the first subtraction of M or 2-M are performed in one clock cycle. 
Any extra modulus subtraction requires an additional clock cycle. As stated 
in subsection 3.1, at most three subtractions of 2-M and/or M are necessary 
to guarantee that the intermediate result is smaller than 3-M (i.e., 3-2"“^). 
Therefore, the processing of a single multiplier bit takes at most three clock 
cycles. The actual number of cycles depends on the size of the intermediate result 
after addition of the partial-product. Table 2 shows typical subtraction sequences 
depending on the range of the intermediate result (Rs,Rc)- The values at the 
third column represent the number of clock cycles required for partial-product 
addition and the subtractions. For example, if 6-2”“^ < {Rs, Rc) < 7-2”“^ then 
typically two subtractions have to be performed. Consequently, two clock cycles 
are necessary for the processing of that multiplier bit. 

According to the subtraction sequences shown in table 2, one can assume that 
any bit of the multiplier takes on average 1.5 clock cycles to be processed. When 
given arbitrary n-bit operands, the computation of A- B mod M requires approx- 
imately 1.5-n clock cycles. Moreover, one or two redundant to non-redundant 
conversions of the result are necessary, each needs \n/vj\ + log 2 (w) clock cycles. 
For an (n-l-l)-bit arithmetic unit and a ru-bit CLA, the number of clock cycles 
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Table 3: Principal operation characteristics of the unified multiplier. 



Operation 


Integer 


GF(p) 


GF(2™-) 


add. 


mult. 


add. 


mult. 


add. 


mult. 


Gycles per bit 


- 


1 


- 


« 1.5 


- 


1 


Max. op. length 


n—1 


n/2 


n 


n 


n 


n 


Operand align 


right 


right 


left 


left 


left 


left 



for a modulo multiplication can be estimated as follows: 



c 



1.5-n+ 1.5-( 



■ n ■ 
w 



-l0g2(w)) 



1.5-n 



(14) 



This means that a multiplication in the prime field GF(p) requires approximately 
1.5- |"log 2 (p)] clock cycles. On the other hand, a multiplication in GF(2’”) is 
finished after m cycles since any bit of the multiplier takes exactly one clock 
cycle to be processed. 



6 Summary of Results and Conclusions 

The subject of this paper was to present a novel bit-serial multiplier architecture 
which operates over finite fields GF(p) and GF(2'"). A multiplication in GF(p) is 
performed in a serial/parallel manner, which means that the multiplier is sched- 
uled sequentially (bit by bit) and the multiplicand is scheduled fully parallel. 
The modulo reduction is interleaved with the multiplication by subtractions of 
once or twice the modulus. Thus, the arithmetic unit has to perform only three 
simple operations: Addition of partial-products, left-shift of the intermediate 
result, and subtraction of once or twice the modulus. Gompared to other bit- 
serial multipliers, the proposed architecture profits from an efficient subtrahend 
estimation circuit which does not cause a significant critical path. The mod- 
ulo multiplier described in this paper is also capable to perform multiplications 
in GF(2"‘), i.e., it is a unified (dual-field) multiplier for GF(p) and GF(2’”). 
Gontrary to architectures which use Montgomery multiplication, the introduced 
MSB-first algorithm requires neither operand transformation into Montgomery 
domain nor precomputed constants. 

The presented design is scalable in size, and an n-bit multiplier operates over 
a wide range of finite fields. For example, a multiplier dimensioned for 200 bits 
can also be used for fields of smaller order, like 192 or 163 bits, by left-aligning all 
operands in the registers. Furthermore, the multiplier can also perform ordinary 
integer addition and multiplication, respectively. The operand size for ordinary 
integer multiplication is limited to about n/2 bits since the product can’t exceed 
n-bit precision. Table 3 summarizes principal characteristics of addition and mul- 
tiplication over integers, prime fields GF(p), and binary extension fields GF(2’”). 

The unified multiplier can be implemented for an area-cost only slightly 
higher than that of the multiplier for the prime field GF(p), providing significant 





218 



J. Grofischadl 



area savings when both types of multiplier are needed. To be more specific, the 
overhead introduced by the dual-field arithmetic is just a logic circuit for setting 
the carry bits of the CSA to 0, which means that this feature comes almost 
for free. Additionally, the architecture is neither restricted to use primes of a 
special form (e.g., generalized Mersenne primes), nor does it favor particular 
irreducible polynomials like trinomials or pentanomials. Another advantage of 
the bit-serial architecture is its high degree of regularity. The presented unified 
multiplier offers a fair area/performance trade-off, which makes it attractive for 
the implementation of a crypto-coprocessor for low-end 8-bit smart cards. 

The correctness of the presented concepts was verified by a functional, cycle- 
based model of the multiplier architecture written is a hardware description 
language. Our future work will be a VLSI implementation of the multiplier. 
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Abstract. Attacks are presented on the IBM 4758 CCA and the Visa 
Security Module. Two new attack principles are demonstrated. Related 
key attacks use known or chosen differences between two cryptographic 
keys. Data protected with one key can then be abused by manipulation 
using the other key. Meet in the middle attacks work by generating a 
large number of unknown keys of the same type, thus reducing the key 
space that must be searched to discover the value of one of the keys in 
the type. Design heuristics are presented to avoid these attacks and other 
common errors. 



1 Introduction 

A cryptoprocessor is a tamper-resistant processor designed to manage crypto- 
graphic keys and data in high-risk situations. The concept of a cryptoprocessor 
arose because conventional operating systems are too bug-ridden and computers 
too physically insecure to be trusted with information of high value. A nor- 
mal microprocessor is enclosed within a tamper-resistant environment, so that 
sensitive information can only be altered or released through a tightly defined 
software interface - a transaction set. In combination with access control, the 
transaction set should prevent abuse of the sensitive information. However, as 
the functionality and flexibility of transaction sets have been pushed up by man- 
ufacturers and clients, this extra complexity has made bugs in transaction sets 
inevitable. 

Sections 2 and 3 of this paper give an overview of cryptoprocessors in the 
context of four important architectural principles, and then describe the new 
vulnerabilities in a generalised way. Sections 4 and 5 introduce attacks on two 
widely fielded cryptoprocessors - the IBM 4758, and the Visa Security Module. 
Finally, some straightforward design heuristics are suggested that, whilst not 
guaranteeing the security of a transaction set, will at least stop the same mistakes 
being made over again. 

2 Tour of a Cryptoprocessor 

A cryptoprocessor’s interface to the world is its transaction set - a group of 
commands supported by the processor to manipulate and manage sensitive in- 
formation, usually cryptographic keys. Users are limited to the subset of the 



Q.K. Kog, D. Naccache, and C. Paar (Eds.): CHES 2001, LNCS 2162, pp. 220-234, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




Attacks on Cryptoprocessor Transaction Sets 



221 



transaction set which reflects their needs using an access control system. The 
intended inputs and outputs of commands in a transaction set are described in 
terms of a type system, which describes the content of each type, and then as- 
signs a type to each input and output of the commands. Keys tend to be stored 
in a hierarchical structure so that large amounts of information can be shared 
by securely sharing only a single piece of information at the base of a branch in 
the hierarchy. 

2.1 Transaction Set Fundamentals 

— User commands are the bulk of the cryptoprocessor’s workload. The com- 
mands allow data to be processed (e.g. encrypted, decrypted, MACs gener- 
ated/ verified) using keys whose values are retained within the tamper-proof 
environment, remaining unknown to the user. The user is thus restricted to 
performing actions with these keys online, where procedural controls can be 
enforced. Application-specific commands may also exist, which manipulate 
encrypted inputs and return an encrypted output or maybe a simple return 
code (e.g. a yes/no answer to whether an entered PIN matched the correct 
value for an account number, without revealing either value). 

— Key Management commands give users the ability to rearrange the key struc- 
ture. Import and export commands will allow extraction of keys from the 
structure for sharing with other processors or environments, and commands 
to build up keys from multiple parts may be available to support dual control 
policies. 

— Administration commands are highly dependent on implementation details, 
but would generally include commands for management of particularly sen- 
sitive high-level keys, modification of the access rights for other users, and 
output of clear PIN numbers in financial systems. 

2.2 Access Control 

Access control is necessary to ensure that only authorised users have access 
to powerful transactions which could be used to extract sensitive information. 
These can be used to enforce procedural controls such as dual control, or m-of-n 
sharing schemes, to prevent abuse of the more powerful transactions. 

The simplest access control systems grant special authority to whoever has 
first use of the processor and then go into the default mode which affords no 
special privileges. An authorised person or group will load the sensitive infor- 
mation into the processor at power-up; afterwards the transaction set does not 
permit extraction of this information, only manipulation of other data using it. 
The next step up in access control is including a special authorised mode which 
can be enabled at any time with one or more passwords, physical key switches, 
or smartcards. 

More versatile access control systems will maintain a record of which transac- 
tions each user can access, or a role-based approach to permit easier restructuring 
as the job of an individual real-world user changes, either in the long term or 




222 



M. Bond 



through the course of a working day. In circumstances where there are multiple 
levels of authorisation, the existence of a ‘trusted path’ to users issuing spe- 
cial commands becomes important. Without using a secured session or physical 
access port separation, it would be easy for an unauthorised person to insert 
commands of their own into this session to extract sensitive information under 
the very nose of the authorised user. 

2.3 Key Hierarchies 

Storage of large numbers of keys becomes necessary when enforcing protection 
between multiple users, and serves to limit damage if one is compromised. The 
common storage method is a hierarchical structure, giving the fundamental ad- 
vantage of efficient key sharing: access can be granted to an entire key set by 
granting access to the key at the next level up the hierarchy, under which the 
set is stored. 

Confusion arises when the hierarchy serves more than one distinct role. Alter- 
nate roles include inferring the type of a key from its position in the hierarchy, 
or increasing the storage capacity of the cryptoprocessor by keeping only the 
top-level keys within the tamper-proofed environment, and storing the remain- 
der externally, with each lower level encrypted using the appropriate key from 
the level above. 

Figure 1 shows a common model with three layers of keys: 




Fig. 1. An example key hierarchy 
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The top layer contains ‘master keys’ which are never revealed outside the 
cryptoprocessor, the middle layer ‘transport keys’ or ‘key-encrypting-keys’ (KEKs) 
to allow sharing between processors, and the bottom layer working keys and 
session keys - together known as ‘operational keys ’, The scope of some crypto- 
processors extends to an even lower layer, containing data encrypted with the 
operational keys. 

2.4 Key Typing Systems 

Assigning type information to keys is necessary for fine grain access control to 
the transaction set. This is because many transactions have the same core func- 
tionality, and without key typing an attacker could achieve the equivalent of 
execution of a transaction he doesn’t have permission for by using an equiva- 
lent permitted transaction (e.g. calculating a MAC can be equivalent to CBC 
encryption, with all but the last block discarded). A well designed type system 
can prevent the abuse of the similarities between transactions. 

An important example is the type distinction between communications data 
keys and PIN processing keys in financial systems. Customer PIN numbers are 
calculated by encrypting the account number with a PIN derivation key, thus 
commands using these keys are carefully controlled. However, if PIN keys and 
data keys were indistinguishable in type, any user with access to data manipula- 
tion transactions could calculate the PIN numbers from accounts: both employ 
the same DES or triple-DES (3DES) encryption algorithm to achieve their pur- 
pose. 

IBM’s financial products use the Common Cryptographic Architecture (CCA) 
- a standardised transaction set. The CCA name for the type information of a 
key is a control vector. Control vectors are bound to encrypted keys by XORing 
the control vector with the key used to encrypt, and including an unprotected 
copy for reference (1). The control vector is simply a bitpattern chosen to de- 
note a particular type. If a naive attacker changes the clear copy of the control 
vector (i.e. the claimed key type), when the key is used, the cryptoprocessor’s 
decryption operation should simply produce garbage (2). The implementation 
details are in ‘Key Handling with Control Vectors’ [2], and ‘A Key Management 
Scheme Based on Control Vectors’ [3]. 

(1) EKm(BCv{KEY) , CV 

(2) DKm®CVMOD{EKm®Cv{KEY)) yf KEY 

3 The Attacker’s Toolkit 

The attacks in sections 4 and 5 are presented as combinations of attack ‘building 
blocks ’. This section describes new building blocks, some intuitively dangerous in 
their own right, and others which only reap maximum damage in combination. 
The full set includes reapplications of existing techniques from other fields, and 
is augmented by the usual tools and methods available to an attacker (e.g. brute 
force search, cryptanalysis). 
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3.1 The Meet in the Middle Attack 

Users can normally select which key is used to protect the output of a command, 
provided it is of the correct type. The flexibility gained from specification using 
the type system is at the price of risking catastrophic failure if the value of 
even just one key within a type is discovered - select the cracked key, and the 
command output will be decipherable. The meet in the middle attack is just 
common sense statistics: if you only need to crack a single key within a type to 
be successful, the more keys that you attack in parallel, the shorter the average 
time it takes to discover one of them using a brute force search. 

The attacker first generates a large number of keys. 2^® (65,536) is a sensible 
target: somewhere between a minute and an hour’s work for the cryptoprocessors 
examined. The same test vector must then be encrypted under each key, and 
the results recorded. Each encryption in the brute force search is then compared 
against all versions of the encrypted test pattern. Checking each key will now 
take slightly longer, but there will be many less to check. The observation at the 
heart of the attack is that it is much more efficient to perform a single encryption 
and compare the result against many different possibilities than it is to perform 
an encryption for each comparison. 

The power of the attack is limited by the time the attacker can spend gen- 
erating keys. It is reasonable to suppose that up to 20 bits of key space could 
be eliminated with this method. Single DES fails catastrophically, its 56 bit key 
space reduced to 40 bits or less. A 2'^® search takes a few days on a home PC. 
Attacks on a 64 bit key space could be brought within range of funded organi- 
sations. The attack has been named a ‘meet in the middle’ attack because the 
brute force search machine and the cryptoprocessor attack the key space from 
opposite sides, and the effort expended by each meets somewhere in the middle. 

3.2 Related Key Attacks 

Allowing related keys to exist within a cryptoprocessor is dangerous, because 
it causes dependency between keys. Two keys can be considered related if the 
bitwise difference between them is known. Once the key set contains related keys, 
the security of one key is dependent upon the security of all keys related to it. It 
is impossible to audit for related keys without knowledge of what relationships 
might exist - and this would only be known by the attacker. Thus, the deliberate 
release of one key might inadvertently compromise another. Partial relationships 
between keys complicate the situation further. Suppose two keys become known 
to share certain bits in common. Compromise of one key could make a brute force 
attack feasible against the other. Related keys also endanger each other through 
increased susceptibility of the related group to a brute force search (see 3.1). 

Keys with a chosen relationship can be even more dangerous because some 
architectures combine type information directly into the key bits. Ambiguity 
is inevitable: the combination of one key and one type might result in exactly 
the same final key as the combination of another key and type. Allowing a 
chosen difference between keys can lead to opportunities to subvert the type 
information, which is crucial to the security of the transaction set. 
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Although in most cryptoprocessors it is difficult to enter completely chosen 
keys (this usually leads straight to a severe security failure), obtaining a set of 
unknown keys with a chosen difference can be quite easy. Valuable keys (usually 
KEKs in the hierarchy diagram) are often transferred in multiple parts, combined 
using XOR to form the final key. At generation, the key parts would be given 
to separate couriers and data entry staff, so that a dual control policy could be 
implemented. Only collusion would reveal the value of the key. However, any key 
part holder could modify his part at will, so it is easy to choose a relationship 
between the actual value loaded, and the intended key value. The entry process 
could be repeated twice to obtain a pair of related keys. Some architectures allow 
a chosen value to be XORed with any key at any time. 

3.3 Unauthorised Type-Casting 

The commonality between transactions makes the integrity of the type system 
almost as important as the access controls over the transactions themselves. 
Once the type constraints of the transaction set are broken, abuse is easy (e.g. if 
some high security KEK could be retyped as a data key, keys protected with it 
could be exported in the clear using a standard data decipherment transaction) . 

Certain type casts are only ‘unauthorised’ in so far as that the designers never 
intended them to be possible. In some architectures it may even be difficult to 
tell whether or not an opportunity to type cast is a bug or a feature! Indeed, IBM 
describes a method in the manual for their 4758 CCA [1] to convert between key 
types during import to allow interoperability with earlier products which used 
a more primitive type system. The manual does not mention how easily this 
feature could be abused. If type casting is possible, it should also be possible to 
regulate it at all stages with the access control functions. 

Cryptoprocessors which do not maintain internal state about their key struc- 
ture have difficulties deleting keys. Once an encrypted version of a key has left 
the cryptoprocessor it cannot prevent an attacker storing his own copy for later 
re-introduction to the system. Thus, whenever this key undergoes an authorised 
type cast, it remains a member of the old type as well as adopting the new type. 
A key with membership of multiple types thus allows transplanting of parts of 
the old hierarchy between old and new types. Deletion can only be effected by 
changing the master keys at the top of the hierarchy, which is radical and costly. 

3.4 Poor Key-Half Binding 

Cryptographic keys get split into distinct parts, when the block length of the 
algorithm protecting them is shorter than the key length. 3DES is particularly 
common, and has a 112 bit key made up from two 56 bit single DES keys. 
When the association between the halves of keys is not kept, the security of the 
key is crippled. A number of cryptoprocessors allow the attacker to manipulate 
the actual keys simply by manipulating their encrypted versions in the desired 
manner. Known or chosen key halves could be substituted into unknown keys, 
immediately halving the keyspace. The same unknown half could be substituted 
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into many different keys, creating a related key set, the dangers of which are 
described in section 3.2. 

3DES has an interesting deliberate feature that makes absence of key-half 
binding even more dangerous. A 3DES encryption consists of a DES encryption 
using one key, a decryption using a second key, and another encryption with 
the first key. If both halves of the key are the same, the key behaves as a single 
length key. {EKi{DK 2 {EKi{data))) = Exidata) when K = K1 = K2). Pure 
manipulation of unknown key halves can yield a 3DES key which operates exactly 
as a single DES key. Some 3DES keys are thus within range of a brute force 
cracking effort. 

3.5 Conjuring Keys from Nowhere 

Cryptoprocessor designs which store encrypted keys outside the tamper-proof 
environment can be vulnerable to unauthorised key generation. For DES keys, 
the principle is simple: simply choose a random value and submit it as an en- 
crypted key. The decrypted result will also be random, with a 1 in 2® chance 
of having the correct parity. Some early cryptoprocessors used this technique to 
generate keys (keys with bad parity were automatically corrected). Most now 
check parity but rarely enforce it, merely raising a warning. In the worst case, 
the attacker need only make trial encryptions with the keys, and observe whether 
key parity errors are raised. The odds of 1 in 2^® for 3DES keys are still quite 
feasible, and it is even easier if each half can be tested individually (see 3.4). 

4 Attacks on the NSM (A Visa Security Module Clone) 

The Visa Security Module (VSM) is a cryptoprocessor with a concise, focused 
transaction set, designed to protect PIN numbers transmitted over private bank 
ATM networks, and on the inter-bank link system supported by VISA. It was 
designed in the early eighties, and the NSM is a software compatible clone [5]. 

The VSM has two authorisation states (user and authorised) enabled using 
passwords. The NSM improves on this by splitting the authorised state in two 
- supervisor and administrator, selected by two key switches on the casing. The 
user state gives access to transactions to verify customers PINs in a number 
of ways, and to translate them between encryption keys to allow forwarding of 
requests to and from other banks in the network. The user state also contains 
transactions to permit key generation and update for session keys. The supervisor 
state is only enabled upon special procedural controls and enables transactions 
to allow extraction of PIN numbers to a printer connected to a dedicated port on 
the cryptoprocessor. Administrator authorisation allows generation of high-level 
master keys, and is rarely used. It recognises nine distinct types in total, shown 
by rectangles in figure 2. The ovals represent individual keys. 

At the top of the key hierarchy are five 3DES master keys, stored in regis- 
ters within the cryptoprocessor. These protect the five fundamental types, and 
all other types are likewise inferred implicitly from a key’s position within the 
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hierarchy. Apart from the 3DES master keys, all other keys are Single DES, and 
so must be changed regularly. The PIN derivation keys are an exception to the 
regular changes, but are afforded extra protection by measures to ensure that 
known plaintext /ciphertext pairs are not available to an attacker. 




Fig. 2. The VSM key hierarchy 



Terminal Master Keys (TMKs) are copies of those used in ATMs, available 
to the VSM so that it can prepare keysets allowing the ATMs to verify PINs 
themselves. PIN keys are used to convert account numbers into PIN numbers. The 
4 digit PINs entered by customers are calculated from the result of encrypting the 
account number with the PIN key, using a publicly available algorithm. TMKs and 
PIN keys occupy the same type in the VSM, even though they are conceptually 
different. Zone Control Master Keys (ZCMKs) are keys to be shared with other 
banking networks, used to protect the exchange of working keys. Working Keys 
(WKs) are used to protect trial PINs that customers have entered, whilst they 
travel through the network on the way to the correct bank for verification, and 
are not used for intra-bank communications. Terminal Communications keys 
(TCs) are for protecting control information going to and from ATMs. Note 
that all keys sent to an ATM are protected with a TMK. Figure 3 shows the 
commands available to the normal user as lines between types. Two extra ‘types’ 
are shown: (RAND) and (CLEAR) . The (RAND) type can be thought of as a source 
of unknown random numbers, so lines emanating from it represent key generation 
transactions. (CLEAR) is a source of user chosen values. The notation TYPE_I is 
used to stand for information encrypted with a key of type TYPE. 

4.1 VSM Compatibles A Poor Type System Attack 

The amalgamation of the TMK and PIN types is responsible for a number of 
weaknesses in the VSM. One possible attack is to enter an account number as 
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Fig. 3. The VSM type system 



a TC key, and then translate this to encryption under a PIN key. The command 
responsible is designed to allow TC keys to be encrypted with a TMK for transfer 
to an ATM, but because TMKs and PIN keys share the same type, the TC can 
also be encrypted under a PIN key in the same way. This attack is very simple 
and effective, but is perhaps difficult to spot because the result of encryption 
with a PIN key is a sensitive value, and it is counterintuitive to imagine an 
encrypted value as sensitive when performing an analysis. Choosing a target 
account number ACCNO, the attack can be followed on the type transition diagram 
in figure 3, moving from (CLEAR) to TC (1), and finally to TMK_I (2). 

(1) ACCNO — > {ACCNO}TC {ACCNO e CLEAR) 

(2) {ACCNO}TC — > {ACCNO}TMK_I {TMKJ = A PIN key) 

Although the attack does not directly exploit any of the methods from sec- 
tion 3, it demonstrates the fragility of transaction sets, and is a good example 
of the characteristics of a broken transaction set when analysed in the context 
of key hierarchies and type systems. 

4.2 VSM Compatibles — Meet in the Middle Attack 

The meet in the middle attack can be used to compromise eight out of the nine 
types used by the VSM. The VSM does not impose limits or special authorisation 
requirements for key generation, so it is easy to populate all the types with 
large numbers of keys. Indeed, it cannot properly impose restrictions on key 
generation because of the ‘key conjuring’ attack (section 3.5) which works with 
many cryptoprocessors which store keys externally. 

The target type should be populated with at least 2^® keys, and a test vector 
encrypted under each. The dedicated ‘encrypt test vector’ command narrowly 
escapes compromising all type because the default test vector does not have the 
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correct parity to be accepted as a key. Instead, the facility to input a chosen 
terminal key (CLEAR — >■ TC in figure 3) can be used to create the test vectors. 
The final step of the attack is to perform the 2^^ brute force search offline. 

The obvious types to attack are the PIN/TMK and WK types. Once a single 
PIN/TMK key has been discovered, all the rest can be translated to type TMK_I, 
encrypted under the compromised TMK. The attacker then decrypts these keys 
using a home PC. Compromise of a single Working Key (WK) allows all trial 
PINs entered by customers to be decrypted by translating them from encryption 
under their original WK to encryption under the compromised one (this command 
is shown by the looping arrow on WK_I in figure 3). 



5 Attacks on the IBM 4758 CCA 

The Common Cryptographic Architecture (CCA) is a standardised transaction 
set which is implemented by the majority of IBM’s financial security products. 
The 4758 is a PC-compatible cryptographic coprocessor which implements the 
CCA. Control over the transaction set is quite flexible: role-based access control 
is available, and the users communicate via trusted paths protected with 3DES 
session keys. The transaction set itself is large and complex, with all the typical 
transactions described in section 2.1, as well as many specialised commands to 
support financial PIN processing. The CCA stores nearly all keys in encrypted 
form outside the cryptoprocessor, with a single 168-bit master key KM at the 
root of its key hierarchy: 

The CCA holds type information on keys using control vectors. A control 
vector is synonymous with a type, and is bound to encrypted keys by XORing 
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the control vector with the key used to encrypt, and including an unprotected 
copy for reference. 

5.1 4758 CCA Key Import Attack 

One of the simplest attacks on the 4758 is to perform an unauthorised type cast 
using IBM’s ’pre-exclusive-or’ type casting method [1]. A typical case would 
be to import a PIN derivation key as a data key, so standard data ciphering 
commands could be used to calculate PIN numbers, or to import a KEK as 
a DATA key, to allow eavesdropping on future transmissions. The Key_ Import 
command requires a KEK with permission to import (an IMPORTER), and the 
encrypted key to import. The attacker must have the necessary authorisation 
in his access control list to import to the destination type, but the original key 
can have any type. Nevertheless, with this attack, all information shared by 
another cryptoprocessor is open to abuse. More subtle type changes are worthy 
of mention, such as re-typing the right half of a 3DES key as a left half. 

A related key set must first be generated (1). The ‘Key_Part_Import ’ com- 
mand acts to XOR together a chosen value with an encrypted key. If a dual 
control policy prevents the attacker from access to an initial key part, one can 
always be conjured (section 3.5). The chosen difference between keys is set to the 
difference between the existing and desired control vectors. Normal use of the 
’Key_Import’ command would import KEY as having the oId_CV control vector. 
However, the identity (KEKl © oId_CV) = (KEK2©new_CV) means that claim- 
ing that KEY was protected with KEK2, and having type new_CV will cause the 
cryptoprocessor to retrieve KEY correctly (3), but bind in the new type new_CV. 

Related Key Set (1) KEKl = KORIG 

KEK2 = KORIG © {old^CV © new^GV) 

Received Key (2) EKEKi®oid_cv{KEY) , old^GV 

Import Process (3) DKEK2®new_Cv{EKEKl®old_Cv{PKEY)) = PKEY 

A successful attack requires circumvention of the bank’s procedural controls, 
and the attacker’s ability to tamper with his own key part. IBM’s advice is 
to take measures to prevent an attacker obtaining the necessary related keys. 
Optimal configuration of the access control system can indeed avoid the attack, 
but the onus is on banks to have tight procedural controls over key part assembly, 
with no detail in the manual as to what these controls should be. The manual 
will be fixed [4], but continuing to use XOR will make creating related key sets 
very easy. A long-term solution is to change the control vector binding method 
to have a one-way property, such that the required key difference to change 
between types cannot be calculated - keys and their type information cannot be 
unbound. 

5.2 4758 CCA Import/Export Loop Attack 

The limitation of the key import attack described in 5.1 is that only keys sent 
from other cryptoprocessors are at risk from the attack, because these are the 
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only ones that can be imported. The ‘Import/Export Loop’ attack builds upon 
the Key Import attack by demonstrating how to export keys from the crypto- 
processor, so their types can be converted as they are re-imported. 

The simplest Import/Export loop would have the same key present as both 
an importer and an exporter. However, in order to achieve the type conversion, 
there must be a difference of (old_CV0new_CV) between the two keys. Gener- 
ate a related key set (1), starting from a conjured key part if necessary. Now 
conjure a new key part KEKP, by repeated trial of key imports using IMPORTERl, 
and claiming type importer_CV, resulting in (2). Now import with IMP0RTER2, 
claiming type exporter_CV, the type changes on import as before (3). 

(1) IMPORTERl = RAND 

IMPORTER2 = RAND © {importer -CV © exporter -CV) 

(2) Ej M PORT ERl®importer_CV {keep) 

(3) D import ER2®exporter_Cv{E import ERl®importer_Cv{KEKP))=KEKP 

(4) EXPORT .CONVERT = KEKP 

(5) IMPORT.CONVERTl = KEKP © {sourceVCV © destlJJV) 

IMPORT.CONVERTn = KEKP © {sourcel.CV © destn.CV) 

Now use Key_Part_Import to generate a related key set (5) which has chosen 
differences required for all type conversions you need to make. Any key with 
export permissions can now be exported with the exporter from the set (4), and 
re-imported as a new type using the appropriate importer key from the related 
key set (5). IBM recommends audit for same key used as both importer and 
exporter [I], but this attack employs a relationship between keys known only 
to the attacker, so conventional audit fails. 

5.3 4758 CCA 3DES Key Binding Attack 

The 4758 CCA does not properly bind together the halves of its 3DES keys. 
Each half has a type associated, distinguishing between left halves, right halves, 
and single DES keys. However, for a given 3DES key, the type system does 
not specifically associate the left and right halves as members of that instance. 
The ‘meet in the middle’ technique can thus be successively applied to discover 
the halves of a 3DES key one at a time. This allows all keys to be extracted, 
including ones which do not have export permissions, so long as a known test 
vector can be encrypted. 

4758 key generation gives the option to generate replicate 3DES keys. These 
are 3DES keys with both halves having the same value. The attacker generates 
a large number of replicate keys sharing the same type as the target key. A meet 
in the middle attack is then used to discover the value of two of the replicate 
keys (a 2“^^ search). The halves of the two replicate keys can then be exchanged 
to make two 3DES keys with differing halves. Strangely, the 4758 type system 
permits distinction between true 3DES keys and replicate 3DES keys, but the 
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manual states that this feature is not implemented, and all share the generic 
3DES key type. Now that a known 3DES key has been acquired, the conclusion 
of the attack is simple; let the key be an exporter key, and export all keys using 
it. 

If the attacker does not have the permissions to make replicate keys, he must 
generate single length DES keys, and change their left half control vector to ‘left 
half of a 3DES key’. This type casting can be achieved using the Key Import 
attack (section 5.1). If the value of the imported key cannot be found beforehand, 
2^® keys should be imported as ‘single DES data keys’, used to encrypt a test 
vector, and an offline 2^^ search should find one. Re-import the unknown key as 
a ‘left half of a 3DES key’. Generate 2^® 3DES keys, and swap in the known left 
half with all of them. A 2'^® search should yield one of them, thus giving you a 
known 3DES key. 

If the attacker cannot easily encrypt a known test pattern under the target 
key type (as is usually the case for KEKs), he must bootstrap upwards by first 
discovering a 3DES key of a type under which he has permissions to encrypt a 
known test vector. This can then be used as the test vector for the higher level 
key, using a Key_Export to perform the encryption. 

A given non-exportable key can also be extracted by making two new versions 
of it, one with the left half swapped for a known key, and likewise for the right 
half. A 2®® search would yield the key (looking for both versions in the same 
pass through the key space). A distributed effort or special hardware would be 
required to get results within a few days, but such a key would be a valuable 
long term key, justifying the expense. A brute force effort in software would 
be capable of searching for all non-exportable keys in the same pass, further 
justifying the expense. 

6 Conclusions 

The cryptoprocessors examined have disappointing dependency upon tight pro- 
cedural controls in the operating environment - they have failed to realise the 
full potential of tamper-resistant enclosure. It is strange that the transaction sets 
of both simple, highly-specialised cryptoprocessors and flexible, complex cryp- 
toprocessors have both been found vulnerable to an individual corrupt insider. 
Perhaps this is because in security the design rule ‘keep it simple’ collides with 
the need for explicitness. The complex systems fail to keep it simple, and the 
simple ones simplify too severely. The design heuristics presented below may go 
against the grain of the ‘keep it simple’ or ‘be explicit’ principles individually, 
but the best solution has to be a compromise. In the best case these heuristics 
go a long way to avoiding security pitfalls, and in the worst case, the heuristics 
at least reveal the areas in which compromises must be made. 

6.1 Design Heuristics 

— Known or chosen keys should not be allowed into the key hierarchy. 
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~ Avoid related key sets. If you must have them, keep relationship secret to 
the cryptoprocessor, or generate them dynamically from a single key. 

— Ensure there is a trusted path to the cryptoprocessor available for the issue 
of sensitive commands. 

— Do not rely on key parity bits for integrity checking: the chance of accidental 
success is too high. 

~ Do not allow transactions to produce ‘garbage’: results with no clearly de- 
fined meaning when the inputs are invalid. This frustrates analysis. 

— Keep access control as fine grain as possible: highly flexible transactions are 
dangerous without highly flexible access control for them. 

~ Avoid types whose roles cross hierarchical boundaries. 

— If using encryption with short key lengths, limit membership levels of types 
to avoid the meet in the middle attack, or prevent test vector generation. 

— Impose restrictions on key generation to limit the attackers options. 

~ Ensure that keys are ‘atomic’ : permitting manipulation of key parts is dan- 
gerous. 

— Be explicit when generating your type system. 

— Don’t try to infer type information from a random number: ambiguity is 
inevitable. 



6.2 Future Directions 

The VSM and CCA architectures have been shown to be unsatisfactory, and a 
skeletal toolkit has been presented for analysing these shortcomings. Research 
awaiting publication includes the application of the new attack techniques to 
more transaction sets, and future research includes the enlargement of the anal- 
ysis toolkit, and the long-term aim of designing a transaction set which is re- 
sistant to these modes of failure, and is well balanced between simplicity and 
explicitness. 
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Abstract. Cryptographic and physical leakage attacks on devices and 
systems which implement cryptosystems, is an area of much recent activ- 
ities. One type of attacks are what is called kleptographic attacks which 
are mounted against black-box cryptosystems. They are issued by and 
serve solely the designer /manufacturer giving it unique advantage. Klep- 
tographic attacks are capable of leaking the private keys of users securely 
and subliminally to the manufacturer of the black-box system (based on 
the availability of public values, such as keys (produced when the sys- 
tem is initiated) or signature/ciphertext values (produced by systems in 
operation). These attacks provide a very high level of security against 
reverse-engineering since even if the black-box is successfully reverse- 
engineered, no information can be obtained that compromises the secrets 
of the users (thus, the unique advantage of the attacker is retained). 
Numerous open questions remain in the area. One issue is that the only 
key generation procedure with known attack is the RSA/ factoring based 
PKC, while for Discrete Logarithm based keys attacks are not known. 
Similarly open, is the existence of bandwidth-optimal leakage attacks, 
namely attacks on a “single signature” in Discrete Logarithm based sig- 
natures (both in the full group and prime order sub-group cases). 

In this paper, we solve the above open questions. We develop new at- 
tack techniques, which unlike earlier attacks, require only one value in 
order to leak the secret. This gives an attack on modular exponentiation 
keys. We then show how to implement an attack on ElGamal signature 
which leaks the private key in each signature, and which requires only 
160 bits of smoothness in p — 1, where p is the common ElGamal prime. 
The attack utilizes the Newton channel. This channel, however, does 
not extend to DSA, since DSA operates in a prime order subgroup of 
Zp. In the second part of this work, we nevertheless show a subliminal 
channel attack on DSA that assumes the existence of a small amount 
of non-volatile memory in the device. This gives a kleptographic attack 
against DSA that leaks the private key in each signature as well. Non- 
volatility is only needed to assure the polynomial indistinguishability of 
the outputs of the devices under attack from that of a normal devices’ 
outputs. We investigate our non- volatility assumption against hardware 
feasibility (in quite a popular EEPROM devices, used in manufacturing 
of smart-cards). 

Key words: Leakage attacks, subliminal channels, the Newton chan- 
nel, design methodologies for asymmetric ciphers, kleptographic attacks, 
attack bandwidth, discrete logarithm based systems, ElGamal, DSA, 
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tamper-proof hardware designs, trust, public scrutiny, non-volatile mem- 
ory, hardware technologies: EEPROM, ferroelectric. 



1 Introduction 

This paper is concerned with attacks by the designer/manufacturer of black-box 
cryptographic systems. The goal is thus to expose non-trivial leakage attacks 
which are possible in black-box cipher designs (where the implementation de- 
sign is not scrutinized, as in tamper-proof hardware devices) but which are not 
necessarily possible in public designs. Indeed, black box usage of cryptography 
has been encouraged by governments. The US government, for example, has pro- 
posed the use of Capstone, a general purpose crypto-processor which grew out 
of the Clipper Initiative. Furthermore, since the mid 80’s the NSA’s Commercial 
COMSEC Endorsement Program has been active in trying to base cryptography 
on government designed tamper-proof devices (see [Sch], page 598). In addition, 
DSA is often included in smart card tokens, which constitutes a black-box en- 
vironment. DSA was proposed as the Digital Signature Standard in the US 
[DSS91]. The motivation of this paper is to investigate the possibility of (till 
now unknown) attacks. These attack involve the design of black-box discrete log 
based systems with sophisticated trapdoors that are bandwidth-optimal (i.e., 
leak based on availability of a single value), are hard to detect and immune to 
reverse-engineering, and at the same time are indistinguishable from the attack- 
free systems. 

Note that it is easy to mount certain attacks on black-box devices, e.g., by 
fixing (or otherwise specifying) the randomness they use. However, when dealing 
with black box environments, the risks posed by the real possibility of successful 
reverse-engineering may out-weigh the benefits of this attack when viewed from 
the perspective of the malicious designer (being a malicious government, say). 
For example, if the “random” DSA exponent k is chosen pseudorandomly based 
on a secret seed rather than being chosen truly at random, then the reverse- 
engineer will gain as much knowledge as the designer if the seed is learned 
via reverse-engineering. Furthermore, even if the seed is chosen randomly for 
each chip, the company responsible for the programming of the chip learns the 
seeds. On the other hand for example, the attacks we present in this paper 
are carried out in such a way that even the programmer of the chip gains no 
knowledge that will help determine the private keys of users. The programmer of 
the crypto-chip will only learn that a suspicious looking variant of DSA is being 
implemented. A related, but different, attack on signature schemes which uses 
weak pseudorandomness has been presented in [BGM97] (in contrast, we use 
strong randomness/ pseudorandomness in all our attacks, but in combination 
with a public key and in a different operational setting) . 

There has been much interest in analyzing cryptosystems with respect to 
their subliminal leakage. Gus Simmons pioneered much of the work on sublimi- 
nal channels where the leakage is universal (leaked to everyone) [Si85,Si93,Si94]. 
Recently, such leakages were suggested not only for leaking information sub- 




Bandwidth-Optimal Kleptographic Attacks 237 



liminally, but securely (privately) even if the device is later reverse-engineered 
(increasing the awareness of the need of trust in the manufacturer of black-box 
devices that looks like they comply with the system’s specifications). The ba- 
sic notions underlying these attacks as well as tools that accomplish them were 
developed in [YY96,YY97a,YY97b]. Specifically, they introduced the notion of 
a SETUP attack where a secretly embedded trapdoor (public key) is used to 
securely leak the secret information out of the cryptosystem. Their attacks are 
geared specifically towards public key systems and exploit randomness and sub- 
liminal channels in key generation, message encryption, and signing. The number 
of leakages needed for recovery of the secret by the attacker was called the setup 
“bandwidth.” For key generation stage which produces the key and nothing 
more, an optimal-bandwidth of one value is a must. All the earlier attacks on 
discrete log based systems like the ElGamal cryptosystem, the ElGamal Digital 
Signature algorithm and the DSA (Digital Signature Algorithm) leak the private 
signing key (say) over the course of two (or more) signatures and no bandwidth 
optimal attack was known. Therefore, key generation attacks on discrete log. 
systems were not known either (in contrast with RS A/factoring). 

The entire issue of optimal bandwidth attacks was open in discrete log based 
system and we solve it in this work. We first utilize the elegant Newton subliminal 
channel to mount optimal-bandwidth attacks on discrete logarithm keys and El- 
Gamal signatures. We then apply a new subliminal channel attack on DSA (the 
first technique does not apply to it). Our second attack requires a limited amount 
of non-volatile memory in the computing environment. We also investigate the 
feasibility and attack life time in the required environment in realistic hardware 
devices employing EEPROM and the emerging ferroelectric technologies. 

Organization: Next we recall and present the basic definitions of the notions 
and systems we use. Section 3 presents the attack based on the Newton channel 
discrete log cryptosystems. Sections 4 and 5 then give and analyze the attack 
on the DSA scheme. The conclusion is in Section 6, and the Appendix presents 
detailed hardware background explaining available implementations of some of 
our conditions in existing hardware technologies. 



2 Definitions 

Our attacks utilize the notion of what is called a SETUP attack (Secretly Em- 
bedded Trapdoor with Universal Protection). The following is the definition of 
a (regular) setup [YY97a]: 

Definition 1. Assume that C is a black-box cryptosystem with a publicly known 
specification. A SETUP mechanism is an algorithmic modification made to C to 
get C such that: 

1. The input of C agrees with the public specifications of the input of C. 

2. C computes efficiently using the attacker’s public encryption function E 
(and possibly other functions as well), contained within C . 
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3. The attacker’s private decryption function D is not contained within C and 
is known only by the attacker. 

4- The output of C agrees with the public specifications of the output of C. 
At the same time, it contains published bits (of the user’s secret key) which 
are easily derivable by the attacker (the output can be generated during key- 
generation or during system operation like message sending). 

5. Furthermore, the output of C and C are polynomially indistinguishable to 
everyone except the attacker. 

6. After the discovery of the specifics of the setup algorithm and after discover- 
ing its presence in the implementation (e.g., reverse-engineering of hardware 
tamper-proof device), users (except the attacker) cannot determine past (or 
future) keys. 

Observe that the above definition does not quantify the number of invocations 
of C for which the SETUP attack is carried out. Hence, it is implicitly assumed 
that for all invocations of C , the output contains the published bits of the user’s 
secret key. In this work we change this quantification from being unbounded to 
being polynomially bounded in some security parameter k (typically, the same 
security parameter as in the underlying cryptosystem). This small change of 
explicit bound is merely motivated by the reality of hardware devices, with finite 
ability to keep/ change a certain state (as a property of the underlying technology 
- we will see an example later). Informally the attack works as follows, the 
attacker chooses some polynomial poly' (in k) and implements C . For the first 
poly' invocations of C, the SETUP attack will be in effect. The user then chooses 
his or her own polynomial poly, and runs C that many times. Only if poly > 
poly' are there invocations of C for which the SETUP attack is not carried out 
(i.e., C behaves honestly). In this case the last poly — poly' invocations of C are 
identical to C. In our attack on DSA, we show that if poly' is large enough (which 
is achievable in practice), then C will have to be invoked polynomially many 
times to reach the point at which it behaves identically to C (in the appendix 
we show how this can easily be 14 years under a reasonable pace assumption, by 
which time the technology is likely to become obsolete). 

Below we give a definition of a SETUP attack to reflect this idea of bound- 
edness. 

Definition 2. Assume that C is a black-box cryptosystem with a publicly known 
specification. A (poly)-bounded SETUP mechanism is an algorithmic modifica- 
tion made to C to get C such that it has the six properties of SETUP and in 
addition it has the following property: 

1. The SETUP attack is carried out a polynomial number of times in k (where k 
is the security parameter of the underlying cryptosystem) , after which C behaves 
identically to C . 

Signature Schemes: 

The following is a review of the ElGamal digital signature algorithm [E1G85] . 
Let p be a large prime, and let g G Zp he an element with order p — 1. The 
signing private key is x Zp-i, and the public signature verification key is 
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y = mod p. Let H he & one-way hash function. To sign an arbitrary message 
TO, the following algorithm is performed: 

1. k Zp-i 

2. a = mod p 

3. b = k~^{H{m) — xa) mod p — 1 

4. output the signature (a, b) 

To verify (a, 5) the verifier makes sure that mod p. 

We will review the Digital Signature Algorithm (DSA). Let p be a large prime 
number such that q \ p — 1 where q concretely is chosen to be a 160 bit prime 
number and p is standardized to some concrete range (512-2048 bits length) as 
well. Let g be an element in Zp with order q. The signing private key is x where 
X €r Zq, and the public verification key is y = g^ mod p. Let SHA denote the 
Secure Hash Algorithm. To sign the message to, we compute: 

1. k Gr Zq 

2. a = {g^ mod p) mod q 

3. b = k~^{SHA{m) + xa) mod q 

4. output the signature (a, b) 

To verify (a, b) the verifier makes sure that a = (^gSHA{m)/bya/b mod q. 

Finally, we will now introduce the underlying cryptographic assumption which 
is utilized in both attacks in this paper. This assumption is based on the Diffie- 
Hellman (DH) assumption [DH76], but adds additional hiding of the secret. Here 
g and the prime p are public, and v, v' which divide p — 1 are also public. We 
will refer to it as the Diffie-Hellman Plus Sum (DH-PS) assumption. 

DifRe-Hellman Plus Sum Assumption: Let A = g°‘ modp, B = g^ modp, 
and c = a+b mod v, where v \ p— 1. It is intractable to compute mod p) mod v' 
where v' | p — 1, and |u'| > M (concretely M is 160), given A, B, and c. 

Here |w'| denotes the bit length of v' . Clearly, if we can solve DH, then we can 
solve the DH-PS assumption and the DH-PS is randomly self reducible (namely, 
we can randomize the input instance and if the randomized new instance is 
solvable, we can translate the result to the result of the original input instance). 



3 SETUP Attacks Based on the Newton Channel 

We start by showing optimal attacks on cryptosystems that operate in all of 
Zp (ElGamal type), rather than in a prime order subgroup. Our SETUP attack 
utilizes the the Newton Channel and leaks the private key in each and every 
signature. The Newton Channel was given in [AVPN] . 
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3.1 Review of the Newton Channel 

Let p = qm + 1 be prime, and let q be prime. Furthermore, assume that m 
is smooth, and that g generates Zp. For security it is assumed that computing 
discrete logs in the group generated by g™ is hard. Let c be the covert message 
that is to be displayed. To display c in an exponentiation mod p using base 
g, a value k' mod (p — 1) jm is chosen randomly, and we solve for k in k = 
c + k'm mod p — 1- Hence, 



k = c mod m 

The user then publishes r = g^ mod p, as in any discrete log based cryptosys- 
tem (such as the ElGamal digital signature scheme). The recipient, who can be 
anyone who decides to recover c, can recover c as follows. The recipient solves 
for z in the equation, 

{g'^Y = r® mod p 

This can be done since the order of the subgroup of Zp generated by g'^ is 
smooth. Let B be the largest prime in m (i.e., its smoothness). Using Pohlig- 
Hellman [Poh78] and Polard’s Rho [Pol78], this requires time 0(i?^/^). It then 
follows that. 



c = z mod m 

The clever Newton Channel was also modified to become narrowcast as op- 
posed to the broadcast channel given above. This is done by replacing q with two 
different primes qi and q 2 , and having the sender and receiver a priori secretly 
share the signing private key mod q 2 and having the sender keep the signing 
key mod q\ private. This however, requires a more specialized form for the fac- 
torization of p — 1, and may result in reducing the security of the underlying 
system. In the SETUP we describe below, no such a priori secret exchange is 
required, and this specialized form for p — 1 is not needed. Thus, the SETUP 
attack can be utilized securely and subliminally under the observance of a war- 
den, as in the case of Simmons original Prisoner’s Problem, without requiring 
that the prisoners exchange a secret before going to prison. 

3.2 Setting up a Discrete Log Based Key Generation 

It is now not hard to add a setup attack to the Newton Channel to leak the 
private exponent x in y = g^ mod p where g generates Zp and p = mm'q + 
1. Here m is smooth and g is a prime which is greater than or equal to 160 
bits in length, m' can be any value. The attack is mounted by computing c = 
E(seed), where if is a public key encryption algorithm. E can be an elliptic curve 
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cryptosystem which outputs 310 bit ciphertexts. Hence, m must be 310 bits in 
this case. Provided c < m, the value for seed is used in the attack. Rather than 
choosing k' randomly as in the Newton Channel, it can be chosen by applying 
a pseudorandom generator to the value seed. The range of the pseudorandom 
generator is Z^'q. k is computed in the same way as in the Newton Channel. 
Therefore, anyone in possession of the private key corresponding to E can recover 
the value for seed and reconstruct k' and hence k. 

Theorem 1 Assuming there is a smooth number within the factorization ofp—\ 
which is large enough to be greater than the size of public key ciphertexts, there 
exists a SETUP attack against any public exponentiation modulo p under the 
Diffie- Heilman plus Sum assumption. 

3.3 Adding a SETUP Mechanism on Top of the Newton Channel 

Let q be prime, and let p = mm! q + 1. We insist that m is smooth and even 
and 160 bits in length. We do not insist on any particular form for m' . We 
will make the simplifying assumption that q is 160 bits, though many different 
configurations on the sizes of m and q are possible. We only require that q < m. 
Let x' mod q be the attacker’s private key, and let y' = ® mod p be the 

attackers corresponding public key (which is not published). Let x Zp-i be 
the unwary user’s private key, and let y = g^ mod p be the corresponding public 
key. The attack aims to securely and subliminally leak x in the sense of a SETUP 
to the attacker via r = g^ mod p (as in the Newton Channel), which is output 
by the device. 

Assume that y' has been placed in the discrete log cryptosystem device that 
is to be SETUP. The attack is mounted as follows. The device chooses R Gr Zq. 
The device then solves for c in. 



c = R + X mod q 

Now, unlike in the Newton Channel, k' is not chosen randomly. Instead, the 
device computes k' pseudorandomly based on y' and c. To be more specific, the 
device solves for k' as follows. 



k' = H{y'^ mod p) 

where H is a public pseudorandom function [GGM86] which uses a secret 
seed that only the attacker and the device knows. We assume that the range of 
H is Zm'q. The device then computes k = c + k'm modp— 1. The device outputs 
r = g^ mod p as in the Newton Channel. This value can, for instance be the first 
value in the pair of values which constitute and ElGamal digital signature. 

The attacker can recover x from r as follows. The attacker recovers c in 
the same way as everyone can using the Newton Channel. The attacker then 
computes t = g'^y~^ mod p. Note that t = g^ mod p. The attacker then solves 
for k' as follows. 
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k' = H (i'"™ ® mod p) 

k is then recovered by computing k = c + k'm mod p — 1. In most (if not all) 
digital signature algorithms, such as ElGamal, knowledge of k implies knowledge 
of X. 



3.4 Security 

Note that the overall security of x is inherently reduced for the users of the 
system overall, due to the existing smoothness in p — 1. If we suppose in the 
worst case that (p — l)/q is entirely smooth, then the users really only posses 
private keys of the form x mod q. So, it is this private key that we will show is 
intractable to recover without x'. Hence, to show security, we must show that 
it is intractable to recover x mod q without x'. To show that it is a SETUP, 
we must show that the chosen r is polynomially indistinguishable from normally 
constructed (chosen) values r given x mod q and not given the secret seed to H. 

Claim 1 It is intractable to recover x mod q without x' given r, y' , x mod m' , 
and the secret seed to H , assuming that the Diffie-Hellman plus Sum assumption 
holds. 

Proof. Since p — 1 has the requisite amount of smoothness, c can be computed 
efficiently from r by anyone. Thus, we know k iff we know k' , because k is the 
Chinese Remainder of c and k' . Since k is the “randomly chosen” secret exponent 
used in the signature which is output (i.e., the secret exponent used to construct 
the ElGamal signature pair), k is known iff x is known. It follows that k' is known 
iff X mod q is known, since we are given x mod m' . From c we then compute 
t = g^ = g‘^y~^ mod p. Now, in the absence of the application of H in the attack, 
we know k' iff we can solve the DH-PS problem with v = q and v' = {p — 1) /q, 
since we know t = g^ mod p, y' = g^ mod p, and c= R + x mod q. Thus, adding 
the use of H in no way helps in recovering x without x' . Hence, for secrecy of 
X mod q, knowledge of the secret seed to H is thus superfluous. QED. 

If X mod m' is not given away, then the security of the system still holds. 
It follows that even if the device is reverse-engineered at a later time and y' 
is found, this does not help the reverse-engineer figure out x. Hence, property 
(6) in the definition of a SETUP holds. We note that we had originally tried 
to reduce the security of this system to that of DSA itself. The idea was to 
make c the DSA “signature equation”, which has the same effect as above: it 
hides X mod q using a randomly chosen value mod q. The problem is that all 
signature equations include a variable which is a commitment of the randomly 
chosen signature exponent, which in the case of our attack, hasn’t even been 
computed yet (we must Ghinese Remainder k' with c to get it) . 

Claim 2 the random variables: k in the attack and k as computed normally are 
polynomially indistinguishable given x and y' , but not given the secret seed to H . 
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Proof. Everyone knows c for each signature. Since c and x are known, R can be 
found from the equation c = R + x mod q. Once R is obtained, the quantity 
(y/fl YYiQd p'j uiod {p — l)/q can be computed. Recall that in a given signature 
which has been set up, this is the preimage under H of k' . However, without 
knowing the secret seed to the pseudorandom function H , it follows from the 
definition of a pseudorandom function that the output of H (in this case k') is 
indistinguishable from randomness (by standard arguments a distinguisher for 
the function can be constructed from a distinguisher for the random variables). 
Since c is truly random, and since k' is pseudorandom, when they are Chinese 
Remaindered, the resulting k is pseudorandom and the same argument follows. 
QED. 

If the exponents used to compute the signatures in each case are indistin- 
guishable (as random variables), then the presence of attacks in the device are 
also indistinguishable since from the perspective of a user who knows his own 
private key, k is the only information conveyed to the user by the device (it can 
be recovered using the user’s own x). It follows that property (5) of a being a 
SETUP is satisfied. Note that if each device is given a unique secret seed, then 
reverse engineering one device does not help in determining whether another 
device is contaminated (in other words: under attack) or not. Also, whether or 
not this seed is known, forward security of x holds due to the use of DH-PS in 
the attack. Properties 1 through 4 of a SETUP hold for this system. So, we have 
therefore shown the following. 

Theorem 2 Assuming at least M (M = 160j hits of smoothness inp—1, there 
exists a SETUP attack against ElGamal (and its variants) that leaks the private 
key in each digital signature, assuming the security of Diffie-Hellman plus Sum. 

4 Attack on Subgroup Based Signature Schemes 

It was observed in [AVPN] that DSA does not support the Newton Channel. This 
is because all of the users of the system use a value g which generates a prime 
order subgroup of Zp whereas the existence of the Newton Channel requires g 
whose order has some smoothness. The question therefore remains whether or 
not an optimal bandwidth SETUP attack exists against DSA and its variants 
(e.g., Nyberg Rueppel [NR94]). In this section we answer this question in the 
affirmative. 

The attack below relies on two specific realistic conditions. First, we assume 
that each specific cryptographic black-box contains within it a unique private 
random identifier string. This requirement can be practically met using a keyed 
hash function during the programming of the crypto device (knowing the key for 
the hash function does not compromise the DSA private keys of users, however). 
Note that each Capstone chip has a unique device identifier^. 

The second requirement is that each black-box device has poly-sized non- 
volatile memory which can be read and written to. This can be realized using 

^ which may differ from another identifier stored in Capstone’s PROM 
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Electrically Erasable Programmable Read-Only Memory (E'^PROM), for in- 
stance. 

4.1 System Setup 

To mount the attack, the designer generates a private key x' Zg and places 
the corresponding public key y' = mod p in the device. A portion of the 
non-volatile memory in the device will be used to store a counter cnt which is 
initially zero. This counter will be incremented by one for each signature that is 
output by the device. Let MAX denote the maximum value of cnt. Also, let ID 
denote the unique (cryptographically secure) identifier for the black-box. Hence, 
each device initially contains the triple (y', cnt, ID) where ID varies from device 
to device. 

4.2 Signing and Verifying 

To sign the message m, the device does not choose the DSA exponent k ran- 
domly, but rather chooses it pseudorandomly. Here x,y,g,p,q are as in DSA. 
The following is the SETUP version of the DSA signing algorithm: 

1. read cnt from non-volatile memory 

2. if cnt > MAX then 

3. k Gr Zq 

4. else 

5. B = H{ID, cnt) 

6. k' = B — X mod q 

7. k = {y'^ mod p) mod q 

8. write cnt = cnt -I- 1 to non-volatile memory 

9. a = {g^ mod p) mod q 

10. b = k~^{SHA{m) + xa) mod q 

11. output (a, 5) as the signature on m 

Here H is a publicly specified pseudorandom function, and ID is used as the 
secret seed to it. We assume that the range of H is Zq. The signature is verified 
as in normal DSA. The intuition behind this attack is that B, which would 
typically be displayed through a subliminal channel, is in fact not displayed at 
all since it is already known to the malicious designer. 

4.3 Recovering the Signing Private Key 

Given x' and the list of device IDs, the signing private key x can be recovered 
from {y, m, a, b) as follows: 

1. for each device identifier ID do: 

2. for i = 0 to MAX do: 

3. B = H{ID,i) 
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4. k = mod p) mod q 

5. if a = {g^ mod p) mod q (i.e. the DSA signature check passes) then 

6. output X = {bk — SHA{m))a~^ mod q and halt with TRUE 

7. halt with FALSE 

The private signing key is found iff the device halts with TRUE. Note that 
if MAX is really large, then an incremental search algorithm may be preferred 
over the above, depending upon the number of devices in existence and on how 
much is known about the particular device that was used to compute (a, b) 
(e.g., starting from an expected number of signatures at this point of time and 
searching up and down incrementally). Note that even the programmer of the 
signing algorithm cannot recover x, since the programmer does not know x', 
only the person who generates y' knows x' (which is presumably the person who 
gave the programmer the code to burn into the chip). Note that the user can 
re- key y anytime without affecting the attack. See the appendix for details on 
how the counter can be implemented using existing non-volatile semiconductor 
memory technologies. 

5 Security 

Claim 3 It is intractable to recover x without x' given y' , i, and the seed ID to 
H , assuming that the Diffie- Heilman plus Sum assumption holds. 

Proof. Given ID and i, the value for B is known. Thus, the value g^ mod p is 
known, since g^y~^ = g’^ mod p. Now, due to the fact that k is used as the ran- 
domly chosen DSA signature exponent for the signature being constructed, k is 
known iff x is known (the signature which is output by the device is employed). 
It remains to show that k (and therefore x) can be found iff the Diffie-Hellman 
plus Sum assumption does not hold. Since B is pseudorandom, k' is pseudoran- 
dom, and hence finding k is exactly the Diffie-Hellman plus Sum problem with 
V = v' = q. QED. 

Thus, even the reverse-engineer who obtains ID, y' , and the secret seed to H 
cannot determine past or future private keys x. Hence, property (6) of a SETUP 
holds for this attack. 

Claim 4 The values for k which are used in the above attack are polynomially 
indistinguishable from the values k chosen in normal DSA signatures, given y' , 
x, and i, but not given the secret seed ID to H. 

Proof. Since iL is a privately seeded pseudorandom function using seed ID, for 
the first MAX invocations of the device B is chosen pseudorandomly and is in- 
distinguishable from random choices. Since B is pseudorandom, it follows that k' 
is pseudorandom mod q. This means that k results from pseudorandomly chosen 
values from Zp which are then reduced mod q, whereas the original k is generated 
similarly but with random elements in Zp. If the spaces are polynomial-time dis- 
tinguishable, by standard arguments one can contradict the pseudorandomness 
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of H. It follows that the values for k in the first MAX invocations are indistin- 
guishable from random choices. For the remaining invocations, k is chosen as in 
DSA itself. QED. 

Therefore, the device’s behavior is indistinguishable from an uncontaminated 
device, even for the user who knows x, and hence property (5) of a SETUP is 
therefore met. By placing unique ID’s in each device, the reverse-engineer cannot 
distinguish contaminated devices from uncontaminated ones without individu- 
ally reverse-engineering them all. Properties 1-4, and 7 hold for this attack. We 
have therefore shown the following. 

Theorem 3 There exists a poly-bounded SETUP attack against DSA which 
leaks the private key of the user in each signature based on the Diffie- Heilman 
plus Sum assumption in a prime order subgroup. 

We note that In the attack above, it would be possible to eliminate the 
need for writable non-volatile memory if a reliable source of time is available. 
If the time counter is never reset, and it has sufficient resolution that the same 
time value is never used for more than one signature, and the attacker can 
guess the time of signing well enough that it is practical to try all possibilities, 
then the counter can be eliminated and time can be used as the input to the 
pseudorandom function. Whether a counter or a timer is more practical depends 
on the application and the setting. 



6 Conclusion 

We showed how to use the Newton Channel to implement an optimal SETUP at- 
tack against ElGamal Signatures assuming 160 bits of smoothness in p— 1, based 
on the Diffie-Hellman plus Sum problem. The notion of a poly-bounded SETUP 
attack was introduced and an optimal SETUP attack on DSA was presented 
which securely and subliminally leaks the DSA private key to the implementor 
in each signature. Hence, in the attack on DSA it was shown that explicit sub- 
liminal channels are not needed at all to effectively leak private DSA keys at an 
optimal bandwidth. These results imply that a single signature can leak a secret 
securely if a manufacturer attacks a black-box implementation. A cryptographic 
assumption of perhaps independent interest is utilized. 
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A Appendix: Implementing the Counter 

We will now describe ways of implementing the non-volatile counter using exist- 
ing technologies, taking care to observe the precise physical operating limitations 
of these technologies. In particular we describe how to implement the counter 
on the ST19SF64 chip from ST Microelectronics. We conclude with a descrip- 
tion of how the counter can be greatly simplified using emerging (ferroelectric) 
technologies. 
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A.l Background Information on EEPROMs 

The predecessor of Electrically Erasable Programmable Read Only Memories 
(EEPROM) memories was Erasable PROM memories (EPROM) which require 
the use of UV light for erasure. Though EPROM memories can be rewritten a 
number of times, a quartz crystal window is needed in the chip to permit erasure, 
and erasing typically requires 20 minutes or so. The first available EEPROM 
chips also allowed multiple writes, but unlike EPROMs, the memory could be 
erased electrically using around 20 volts or so. This however required a separate 
pin on the chip for the high programming voltage, whereas only 5 volts (the 
standard operating voltage) was needed for the read operation. Eventually the 
technology advanced to where only 5 volts were needed for reads and writes. 
These chips contain voltage amplifiers internally to perform the erase operation. 

It is a thesis of this paper that this 5 volt (and lower) EEPROM technol- 
ogy marked a major turning point in the level of trust that must be placed in 
the manufacturers of cryptographic processors, whether the processors contain 
non-volatile memory or not. The reason for this is that for the first time state- 
less tamper-proof microprocessors became indistinguishable from tamper-proof 
microprocessors containing EEPROM, since no crystal window is needed, and 
since the operating voltages are the same. 

EEPROMs have two major operating characteristics: durability and data 
retention. Durability refers to the number of times in which a given byte can be 
erased and written to, and data retention refers to how long a byte can reliably 
store its value after it is written to. Modern EEPROM memories typically have 
an endurance of 10® and a data retention value of 10 years or so. Note that 
with these characteristics, in theory a byte can be used for more than 10 years, 
provided that it is not written to more than 100,000 times, and provided that 
no more than 10 years passes between each write (in many cases the retention 
has to do with the discharging of a capacitive layer in the memory cell). These 
operating characteristics imply that it is not possible to simply utilize 4 bytes of 
EEPROM as a 64 bit counter, since the lower order byte is not durable enough 
to handle that many writes. 

The reasons for these limitations have to do with the device physics of modern 
EEPROMs. We will now briefly summarize why these limitations exist. Modern 
EEPROMs are based on a stored charge concept in which the presence or absence 
of a stored charge in a MOSFET transistor (typically in a “floating” gate which 
is isolated by Si 02 ) affects the flow of electrons from the drain to the source 
leads. Another technology (SNOS) utilizes charge trapping material instead of 
a floating gate. The presence or absence of (a significant amount of current) 
between these leads indicates a binary 0 or 1, and this current is controlled by 
the electric held given off the stored charge if charge is present. A number of 
methods are used to inject and remove the stored electrons in the transistor, the 
most prominent being quantum mechanical tunneling, and hot electron injection. 

The factors affecting endurance are tunnel oxide breakdown, gate oxide break- 
down, and trap-up. The first two cause short circuits in the device, thus rendering 
it unable to store charge. Trap-up refers to electrons being trapped in the tun- 
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neling insulator, thus weakening the injection fields and therefore not allowing 
enough charge to get to the stored charge area during programming. For float- 
ing gate devices, there is no intrinsic retention problem (and is limited only by 
device defects) . Clearly, full testing is not possible since the tests would be con- 
cluded long after the competitive lifetime of the chip. High temperature tests are 
thus performed, and most retention failures are actually endurance failures. The 
retention characteristics are different for charge trapping devices, though these 
are typically only used in military applications which require reliable operation 
in radioactive environments (e.g., to insure reliable missile guidance systems in 
fallout). 

A. 2 An Attack Using the ST19SF64 

Below are the specifications for the ST19SF64 CMOS Smartcard MCU chip by 
ST Microelectronics, with 64k EEPROM, 32k user ROM, and 960 bytes RAM. 

Byte write time = 1 milisecond 
Data Retention = 10 years 

Automatic write operation with internal control timer 
Vcc = 5 Volts (or 3 volts) 

Endurance = 100,000 erase/write cycles 

Note that the read operation requires on the order of nanoseconds. 
Typically, cryptoprocessors utilize a portion of E'^PROM for the cryptocode, 
so we will assume in our attack that 32k of EEPROM is reserved for crypto- 
graphic operations and that 32k are available to implement the counter cnt. 
We assume that the secret cryptographic device identifier ID, and the attacker’s 
public key y' are stored in the 32k of EEPROM along with the cryptographic 
code. 

A. 3 Implementing the Non-volatile Counter 

If we were to utilize, say 4 bytes of non-volatile memory for a counter cnt that 
is incremented from zero we will not get very far since only 100,000 writes can 
be made reliably. We thus need to design a counter that can exceed 100,000 
utilizing the 32k available bytes. First observe that the counter value need not 
be incremented by one, since it is simply used as an argument to a random 
oracle PI . Hence, all that is needed is a polynomial number of unique values 
for cnt. Using this observation, a counter permitting 2^*^ different values can be 
synthesized as follows. 

The counter cnt is the entire 32k bytes. We divide the 32k byte memory 
space into 16k words, each of which is 2 bytes. Initially, every word is zero. 
To increment the counter, we read in the words from memory until the least 
significant word which is not all binary I’s is found (if any). We then add 1 to 
this value if found. Note that each word can only be incremented at most 2^® — 1 
times. It follows that each byte is only written to at most 2^® — 1 times (which is 
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less than 100,000 as required). The counter can no longer be incremented when 
all of the bits in the counter are 1. It follows that there are 2^^ * 2^® = 2®® 
different possible values for cnt. This implementation of cnt requires that all 32k 
bytes of cnt be read in the computation of B using the random oracle H . Since 
reads require on the order of nanoseconds, this is still very fast in comparison 
to the time required to compute the signature. 

A. 4 Operating Statistics of the Attack 

We were unable to find benchmarks for the time required to compute a DSA 
signature on a dedicated crypto-processor. So, we will cite the time required 
to compute a DSA signature in software on a SPARC II using CryptoLib. The 
time required in this case is 430 milliseconds where |p| = 768 bits [LMS]. Note 
that Cryptolib uses some of the best algorithms to do modular exponentiation, 
including Montgomery Reduction, Vector Addition Chains, and Karatsuba mul- 
tiplication. Below we give some of the characteristics of our attack in this setting: 

SPARC II CryptoLib DSA signing time = 430 ms 

time to read cnt (i.e., time to read all 32k bytes) = 4.92 ms 

time to update cnt < 4.92 + 2 = 6.92 ms 

total non-volatile memory based overhead < 11.84 ms 

number of signatures which are SETUP = 2®® 

time required to exhaust cnt = 2®® * 430 ms > 14 years 

It follows from the above that the time required to mount the attack can 
be “absorbed” by the time required to compute the DSA signature. Measures 
should be taken to insure that the signing time is the same whether or not all 
values for cnt have been exhausted, to avoid detection of the attack. 
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Abstract. Although the possibility of attacking smart-cards by analyz- 
ing their electromagnetic power radiation repeatedly appears in research 
papers, all accessible references evade the essence of reporting conclusive 
experiments where actual cryptographic algorithms such as DBS or RSA 
were successfully attacked. 

This work describes electromagnetic experiments conducted on three dif- 
ferent CMOS chips, featuring different hardware protections and executing 
a DBS, an alleged COMP128 and an RSA. In all cases the complete key 
material was successfully retrieved. 

Keywords: smart cards, side channel leakage, electromagnetic analysis, 

SBMA, DBMA, DPA, SPA. 



1 Introduction 

In addition to its usual complexity postulates, cryptography silently assumes 
that secrets can be physically protected in tamper-proof locations. 

All cryptographic operations are physical processes where data elements must 
be represented by physical quantities in physical structures. These physical quan- 
tities must be stored, sensed and combined by the elementary devices (gates) of 
any technology out of which we build tamper-resistant machinery. At any given 
point in the evolution of a technology, the smallest logic devices must have a 
definite physical extent, require a certain minimum time to perform their func- 
tion and dissipate a minimal switching energy when transiting from one state to 
another. 

This paper analyzes an area of recent interest - electromagnetic side-channel 
attacks ~ which exploits correlations between secret data and variations in power 
radiations emitted by tamper-resistant devices. 

Since any electrical current flowing through a conductor induces electromag- 
netic (em) emanations, it seems natural to look for the same phenomenon in the 
vicinity of a semiconductor. As the power consumption of a tamper-resistant 
device varies while data are being processed, so does the EM field and one may 
legitimately expect to extract secret information from a relevant EM analysis. 

In some cases, power curves appear to convey no information: this happens 
when power does not vary or does vary but in a way seemingly uncorrelated to the 
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secret data. Very much simplified, the chip’s global current consumption can be 
looked upon as a big river concentrating the sum of the small tributaries flowing 
into it. If the subcomponents’ contributions could be determined, then the small 
streams would be isolated. This is impossible by direct electrical measurement 
but should become possible by eavesdropping local EM radiations. By opposition 
to power analysis, this requires the design of special probes and the development 
of advanced measurement methods that focus very accurately selected points of 
the chip. 

For the sake of scientific accuracy, we would like to precise that this paper 
does not claim the discovery of EM information leakage (which is attested by 
numerous accessible sources [1,2,3,8,9,10,12]); we rather report complete and 
conclusive experiments where secrets used by specific cryptographic algorithms 
running on eight-bit CMOS microcontrollers were thoroughly disclosed. 

Intentionally, none of the tested programs featured software counter-measures 
against power or EM attacks and in each case the EM information leakage was 
compared to the result of power attacks performed under identical experimental 
conditions. 

The rest of this work is organized as follows: in section 2 we describe the ex- 
perimental conditions under which our results were obtained. The results them- 
selves are presented, commented and compared to power leakage in section 3. 

2 Electromagnetic Analysis 

2.1 Probe Design 

Chip-scale electromagnetic analysis requires very small probes, similar in dimen- 
sion to the chip areas to be isolated. The standard layout of a smart card chip 
shows functional blocks of a few hundred microns (CPU, cryptoprocessor). This 
defines an upper bound for the probe size. 

Although this experimental study was carried out by successively trying dif- 
ferent kinds of sensors such as hard disk heads, integrated inductors and magnetic 
loops [5,7], the best em signals were collected using simple hand-made probes. 
These are solenoids made of a coiled copper wire of outer diameters varying 
between 150 and 500 microns. An example is shown in Figure 1. 



2.2 Electrical Behavior 

An important advantage of such inductive sensors is their broadband. In other 
words, a resonance frequency which is much higher than the highest frequency 
that the analyzed chip is able to generate. The characterization of such sensors is 
a rather difficult task requiring the generation of a constant-magnitude magnetic- 
field over a very broad spectral band (several tens of MHz) . 

The main drawback of such probes is their very low output signal (typically 
2 to 4 mV peak to peak) . Sensitivity can be enhanced at the expense of bigger 
frequency selectivity, thereby resulting in some bandwidth loss. The trade-off 
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was finally settled for the benefit of bandwidth as radiation spectra were un- 
known. The amplitude’s weakness was compensated by the use of an advanced 
acquisition chain featuring a very efficient amplification stage. 

Most chips are designed in CMOS technology. Figure 2 shows a CMOS logic 
inverter. The inverter can be looked upon as a push-pull switch: in grounded 
cuts off the top transistor, pulling out high. A high in does the inverse, pulling 
out to ground. CMOS inverters are the basic building-block of all digital CMOS 
logic, the logic family that has become dominant in very large scale integrated 
circuits (VLSi). 




Fig. 2. Elementary CMOS gate. 



During a transition from 0 to 1 or vice-versa, the device’s n and p transistors 
are on for a short period of time. This results in a short current pulse from Vdd 
to V"ss. This (very partially) explains why information leaks when data flips and 
why the power curve is correlated to the transition’s Hamming distance. 

This sudden current pulse causes a sudden variation of the EM field surround- 
ing the chip which can be monitored by inductive probes which are particularly 
sensitive to the related impulsion [5,7]. The electromotive force across the sensor 
(Lentz’ law) relates to the variation of magnetic flux as follows: 

V=-^ 

dt 



254 



K. Gandolfi, C. Mourtel, and F. Olivier 



where V, 4> and t denote the probe’s output voltage, the magnetic flux sensed by 
probe and the time. In practice, parasitic resistors and inescapable measurement 
imprecisions require a slight correction of the probe’s output. 

Whenever a bit flips, the resulting time signal exhibits a high frequency 
damped oscillation. Acquisition was optimized to better reflect these variations 
and data dependencies. Frequency-tuned signal processing can be applied. This 
may require, approximately, a 1 GHz sampling frequency. 

Figure 3 shows a power consumption example while Figure 4 shows the cor- 
responding EM signal. The monitored signal is caused by the execution of a 
transfer into accumulator instruction (tia) applied to OOh and FFh (the specific 
instruction name was deliberately changed to keep the chip’s identity secret). 

EM curves appear to be more noisy than power curves, but feature sharper 
data signatures. Moreover, EM signals can be phase-reversed given the minus 
sign in Lentz’ law and the probe’s spatial position: the magnetic flux is inverted 
by changing the side of the source where the sensor is present. 

To reduce parasite signals, an attempt was made to host the chip and the 
probe in a Faraday cage. This had little effect and Anally proved to be unnec- 
essary. Isolating an experiment from external high frequency radiations proves 
to be a nontrivial engineering exercise for even if the probe can be hosted in a 
pollution-free cage, most elements in the acquisition chain remain sensitive to 
ambient EM noise and prone to mutual (cross-talk) perturbations. 



2.3 Spatial Positioning 

To increase the chances of capturing data-dependent signals, the probe was po- 
sitioned in the neighborhood of a region that radiates while the program runs. 
Areas radiate with different intensities and various code dependencies but, ex- 
perimentally, the most active points appear to be located near the CPU, data 
buses and power supply lines. Amongst these three, the CPU seemed to be the 
most data-dependent. 

Each curve in Figure 5 is the difference between two traces: that of OOh© OOh 
and FFh © OOh. This simple experiment illustrates the information leakage of 
the exclusive-or instruction via the power consumption and EM radiation as 
measured at five different locations : ROM, eeprom, ram, the supply line and 
the CPU. Each area features a distinct signature either through the signal’s shape 
or magnitude. The CPU clearly stands out by radiating the most informative 
signal. 

Approximating the source as a long linear wire (or the probe as negligibly 
small), the field’s magnitude B decreases (Biot and Savart’s law) as the inverse 
of the distance r between the wire and the probe: 

B=eil 

2tt r 



where I denotes the current flowing through the wire. It is thus important to 
perform measurements as closely as possible to the chip. Since the standard 
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Fig. 4. EM radiation during TIA. 



thickness of a card is 800 microns, landing the sensor on its back sets the obser- 
vation point at 400 or 500 microns away from the target. 

This distance may sometimes appear to be prohibitive given the weakness 
of the EM power radiant and its low signal to noise ratio. However, in some 
cases the chip’s surface can be eroded by mechanical or chemical means [9,3]. 
This operation (called decapsulation) offers two important advantages: once the 
chip is bare (if still functional), the probe’s coil can be lowered so as to touch 
the passivation layer and thereby capture the highest possible field. As a side 
effect, the chip becomes optically visible and its specific blocks can be pinpointed 
more accurately. Recapsulating the chip after the attack remains possible for 
industrially-equipped attackers. 
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Fig. 5. Differentials between _B(00h© OOh) and _B(FFh© OOh) : significative spikes are 
located near t = 100 and t = 1200; other regular spikes are just clock residues. 



3 Practical Results 

We conducted practical experiments on various devices and algorithms. In this 
section three of the most significant results are presented. Interestingly, the three 
were conducted on different chips made by different manufacturers. 

One of these chips is protected by a shield and the other two feature ran- 
domly synthesized logic (rsl) . This means that the CPU is scrambled with other 
functional and useless blocks to make specific functions difficult to identify. Such 
designs thwart physical intrusions using Focused Ion Beam test equipment (fib). 

The attacked algorithms were respectively the alleged COMP 128 (described 
in [6] and hereafter denoted ACOMp128), des and RSA. In all cases software 
counter-measures were deliberately turned off. For a comparative study, test 
cards were calibrated with known keys. Power and EM signals were systemati- 
cally acquired simultaneously. Once conditioned, the EM signals were digitized 
and processed exactly the same way as classical power signals, using the same 
sampling frequency, digitizer and software tools. Only their physical nature dif- 
fered. 

J.-J. Quisquater and D. Samyde suggested [12] the following acronyms: dema 
for Differential ElectroMagnetic Analysis, by analogy to P. Kocher’s Differential 
Power Analysis (dpa) [8]. Simple ElectroMagnetic Analysis (sema) relates in 
a similar way to Simple Power Analysis (spa). da and SA will be used for 
Differential and Simple Analysis, when the leakage’s physical nature happens 
to be irrelevant. D. Naccache coined the Greek term cryptophthora to generically 
address the phenomenon of side channel leakage. 
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3.1 Alleged COMp128 

In this experiment, the smart card was not decapsulated and therefore the probe 
was positioned rather approximately. No software dpa/dema counter-measures 
were activated. 

A DPA and a dema were performed simultaneously on the same batch of 
256 chosen messages and related sets of curves. Results are shown in Figure 6: 
the two attacks generated differential spikes for the same right guess. Despite 
more noisy measurements, the dema provided better peaks than the DPA both in 
terms of contrast and signal to noise ratio. Equivalently, the dema required less 
acquisitions than the DPA. The sign opposition in the raw signatures remained 
visible throughout the whole differential analysis process. Moreover, since wrong 
guesses provided no peaks, the experimental evidence was brought that dema 
could work successfully. 

3.2 DES 

Having obtained these first results, a new dema was attempted on another com- 
ponent. Again, no decapsulation was performed thereby preventing a very ac- 
curate positioning. The attacked algorithm was a DES featuring no software 
counter-measures against DPA. For DPA and dema, 500 acquisitions and mes- 
sages were necessary to infer the secret key. 

While performing a DPA, it is expected that the right guess would yield the 
maximum peaks but, experimentally, strong differential peaks are often observed 
for wrong guesses. Such false alerts may even rise higher than the right spikes 
and confuse an attacker trying to make a final decision. This phenomenon stems 
from the consumption model that underlies classical DPA. Indeed, the signature 
is usually supposed to be correlated to the data’s Hamming weight. In reality 
this may not match each and every VLSI behavior as other subparts of the chip 
may also consume power in a correlated manner. 

In the present case, a DPA spotted the right guess successfully but with many 
difficulties. As shown in Figure 7, there were even examples of wrong guesses 
(39) whose peaks were higher than the right one (15) in absolute value. The 
corresponding dema yielded correctly ordered spikes, smearing the wrong guess 
peaks and enhancing the right one. 

Compared to DPA and for a relevant pinpointed area, experiments generally 
showed that dema tended to reduce the dispersion of peaks to the benefit of 
the right guess. In other words, the number of wrong guesses was reduced and 
the final decision made easier, dema can therefore be potentially considered 
experimentally at least as efficient as DPA, in absence of specific software counter- 
measure. 

3.3 Modular exponentiation 

The third experiment concerned an RSA exponentiation performed in a decap- 
sulated smart card. The chip’s visibility allowed a very close positioning of the 
sensor and the monitoring of the most energetic part of the EM field. 
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Fig. 7. DPA and DEMA DES curves for a right (15) and a wrong (39) guess 
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No software em/pa counter-measures were implemented to protect the ex- 
ponentiation (except a constant time implementation). As shown in Figure 8 
(lower traces) the power traces did not suggest any apparent pattern that could 
have exposed the chip to a potential spa. 

Having observed this, the target of our study became the isolation a data- 
correlated location which is why the chip’s surface had to be scanned. The success 
of this tedious operation was not guaranteed but a suitable point was finally 
found after several manual positioning attempts. 

Two EM signals monitored at this point are shown in Figure 8 (upper traces). 
They look less noisy than the power curves and happen to contain patterns that 
leak the key. This illustrates how complementary SEMA and SPA can be. 




Fig. 8. EM and power traces for two different exponentiations involving three bytes of 
the private key : FFASFFh and 666666h (Same message and modulus). Artificial spikes 
delimitate the three-byte windows where patterns clearly appear. 



4 Conclusion and Work in Progress 

The purpose of this work was to find out if EM attacks can be implemented in 
practice; the answer is clearly positive. 

Our experiments suggest that although more noisy, EM measurements finally 
yield better differentials than power signals, dema’s SNR was higher than dpa’s 
SNR and the correct guess identification was easier, as there were no false alerts 
due to erroneous peaks. The third experiment is particularly instructive as it 
shows that SEMA SPA. As is obvious, this shouldn’t lead to the fallacious con- 
clusion that SEMA is in some manner “more powerful” than SPA: we haven’t en- 
countered yet the opposite case (SEMA-proof, spa- vulnerable) but nothing rules 
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out a priori that it might exist. In other words, when PA or EM A does not suffice 
alone, both can be attempted simultaneously. 

ema’s advantage is definitely its capability of exploiting local information. 
This geometrical degree of freedom is useful as it allows to pinpoint the prob- 
lematic spots that leak information, pa’s major advantage is undoubtedly the 
relative simplicity of electric measurements as opposed to EM ones. 

The manual scanning of the chip’s surface performed during this work are, of 
course, non-exhaustive. The next step in our investigation is the implementation 
of automatic cartography tools. Note that chip-spots characterized by intensive 
power radiations (e.g. clock lines) do not necessarily leak data-correlated EM 
signals. Procedures for evaluating the likelihood of data-correlated leakage are 
described in [4] . By running such tests on EM signals collected at various locations 
on a given chip, a cartography of leakage probabilities can be performed. This 
would give an immediate bird-eye view of the potentially problematic spots in 
each chip and allow cross-platform comparisons. 

Natural EM hardware counter-measures typically include an upper metal 
layer (contain the radiation), variable random currents, flowing through an ac- 
tive grid and generating noisy fields (blur the radiation^) and successive tech- 
nology shrinks that regularly reduce the elementary transistors’ size and make 
the functional areas more compact (reduce the radiation). Particular synthesis of 
problematic functions (coding ones as {1, 0} and zeros as {0, 1}) tries to partially 
cancel the radiation. 

It is our opinion that the combination of such hardware counter-measures 
with particular software coding techniques that inherently prevent specific forms 
of leakage, provides an acceptable security-level for most commercial applica- 
tions. 
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Abstract. The growing connectivity offered by constrained computing 
devices signals a critical need for public-key cryptography in such envi- 
ronments. By their nature, however, public-key systems have been dif- 
ficult to implement in systems with limited computational power. The 
NTRU public-key cryptosystem addresses this problem by offering bet- 
ter computational performance than previous practical systems. The ef- 
ficiency of NTRU is applied to a wide variety of constrained devices in 
this paper, including the Palm Computing Platform, Advanced RISC 
Machines ARM7TDMI, the Research in Motion pager, and finally, the 
Xilinx Virtex 1000 family of EPGAs. On each of these platforms, NTRU 
offers exceptional performance, enabling a new range of applications to 
make use of the power of public-key cryptography. 



1 Motivation 

Since their introduction in the 1970s, the development of microprocessors and 
public-key cryptosystems has been intervolved. Ever faster, cheaper, better mi- 
croprocessors have allowed the use of public-key cryptosystems in a dizzying 
array of applications. 

One of the more popular of these is the use of desktop personal computers 
to mediate the purchase of goods and services on the Internet. For years now, 
desktop computers have offered adequate performance to make the arduous cal- 
culations involved in traditional public-key cryptosystems invisible to the casual 
user. This performance has resulted in the ubiquitous deployment of crypto- 
enabled web browsers such as Microsoft’s Internet Explorer on desktop PCs. 

Conversely, the need for public-key cryptography has led microprocessor ven- 
dors to add functionality to their products. Desktop eCommerce led Intel to add 
random- number generation and unique IDs to the Pentium/III processor. The 
need for secure authentication in GSM cellular telephone applications has re- 
sulted in 8-bit microcontrollers with custom hardware to accelerate modular 
exponentiation. 

The number of embedded systems that require cryptography is about to 
explode. Just as the ubiquitous PC networking made possible by TCP/IP led 
to public-key crypto libraries in desktop web browsers, wireless networking is 
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set to offer universal connectivity to a diverse array of computing devices. From 
washing machines to cell phones, televisions to automobiles, wireless networking 
standards such as Bluetooth and IEEE 802.11b will bring networking to devices 
that previously stood alone. As we’ve seen on the desktop, with communications 
comes the need for security, and so for public-key cryptosystems. 

In contrast to their desktop-bound, powerful brethren, these embedded de- 
vices offer severely constrained computing capacity. Power, memory, and CPU 
cycles all must be judiciously conserved. 

The computational efficiency of NTRU allows implementors to build ef- 
ficient wirelessly-communicating embedded systems. Furthermore, algorithmic 
improvements introduced in [4] augment the original construction to allow for 
greater computational savings. 

In this paper, we apply these results in the context of embedded systems. 
We report on fast NTRU implementations for the Palm Computing Platform, 
the Research in Motion pager, the Advanced RISC Machines ARM7TDMI, and 
finally field-programmable gate arrays (FPGAs). 

2 The NTRU Public-Key Cryptosystem 

NTRU is a public-key cryptosystem based on the Shortest Vector Problem in 
a lattice. Lattices find application in pure and applied mathematics, computer 
science, physics, and cryptography. In particular, the SVP has been intensively 
studied for more than one hundred years for its use in these and other areas of 
mathematics and science. Theory and experimentation [2] suggest the SVP is 
difficult in lattices of very high dimension. Such instances of the SVP form the 
basis of NTRU. 



2.1 Basic Setup 

NTRU is best described using the ring of polynomials 

R= Z[X]/{X^ -1). 

These are polynomials with integer coefficients 

fl(A) = Oq T CLlX tt2X^ -|- • • • -|- ClN—lX^ ^ 
that are multiplied together using the extra rule X^ = 1. So the product 

c(A) =a{X)*b{X) 



is given by 



Cfc — Oo6fc -|- aibk-l UN-lbk+l — ^i+j = k mod N dibj. 

In particular, if we write a{X),b{X), and c{X) as vectors 

a = [oo, oi, • • • , oat-i], b = [bo,bi,- ■ ■ ,bN-i], c = [cq, ci, • • • , cat_i], 

then c = a * b is the usual discrete convolution product of two vectors. 

To quickly sum up the other relevant basic properties of NTRU: 
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1. NTRU uses three public parameters (N,p,q) with gcd{p,q) = 1. 

2. Typical parameter sets that yield security levels similar to 1024-bit RSA 
and 4096-bit RSA respectively are (N,p,q) = (251,3,128) and (N,p,q) = 
(503,3,256). 

3. Coefficients of polynomials are reduced modulo p and/or modulo q. 

4. The inverse of a{X) mod q is the polynomial A{X) G R satisfying a{X) * 
A{X) = 1 mod q. 

The inverse (if it exists) is easily computed using the Extended Euclidean 
Algorithm. Inverses are only needed for key generation. 

2.2 Key Generation 

Choose random polynomials F,g G R with small coefficients and set / = 1+pF. 
Compute the polynomial 

h = g * f~^ mod q. 

The public key is h and the private key is /. 

2.3 Encryption 

The plaintext to is a polynomial with coefficients taken modp. Choose a random 
polynomial r with small coefficients. The ciphertext is 

e = pr * h + m mod q. 



2.4 Decryption 

Compute 

a = e*f mod q, 

choosing the coefficients of a to satisfy A < ai < A + q. The value of A is fixed 
and is determined by a simple formula depending on the other parameters. Then 
a mod p is equal to the plaintext to. 

2.5 Why NTRU Works 

The decryption process yields the polynomial 
a = e* f mod q 

= {pr * h + m) * f mod q (since e = pr*h + m) 

= pr*g + m*f mod q (since h*f = gf — l*f = g) 

The coefficients of r, g, to, and / are small, so the coefficients of 

pr * g + m* f 

will lie in an interval of length less than q. Choosing the appropriate interval, 
we recover 

a = pr*g + m*f = pr*g + m*{l+ pF) 
exactly, not merely modulo q. Then reduction modulo p yields a = m mod p. 
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3 NTRU Algorithmic Optimizations 

3.1 Choice of p 

If the coefficients of the polynomial pr*g+m*f do not lie in an interval of length 
at most q, then decryption will not work. Appropriate choices of parameters 
reduce this to a very low probability, which may be reduced even further by the 
following observation. 

The discussion above assumes that p is an integer, where we recall that p 
and q must be relatively prime. However, as noted in [4], there is no particular 
reason that p must be an integer. It could instead be a polynomial, provided the 
ideals generated by p and q are relatively prime in the ring R. Our first choice 
for such a polynomial would naturally be a binomial ± 1. Unfortunately, the 
elements 

X'^ ± 1 and - 1 and 128 
are not relatively prime in Z[X], 

The next natural candidate is p = A + 2. It is simple to verify the relative 
primality of p and q in this case: 

A^ - 1 = A^ + 2^ - 2^ - 1 = (A^ + 2^) - 128 • 2^"^ - 1. 

As noted in [1] and [4], operations modulo binomials are efficiently computed. 
Thus p = A + 2 and g = 128 form the basis for a very efficient implemention of 
NTRU. 

3.2 Polynomials of Low Hamming Weight 

The most time consuming part of NTRU encryption is computation of the 
product r(A) * h{X) mod q. Similarly, the most time consuming part of NTRU 
decryption is computation of the product /(A) * e(A) mod q. The polynomi- 
als h{X) and e(A) have coefficients that are more or less randomly distributed 
modulo q, while one normally takes r(A) and /(A) to have binary (i.e., 0 or 1) 
or ternary (i.e., —1, 0, or 1) coefficients. 

Suppose that r(A) is a binary polynomial with d ones. Then computation 
of the product r(A) * h{X) mod q requires approximately dN operations, where 
one operation is an addition and a remainder modulo q. 

A common trick (see [7] for instance) is to choose a polynomial of low Ham- 
ming weight. We extend this idea by taking a product of low Hamming weight 
polynomials as suggested in [5]. To this end, we write r(A) = n(A)r 2 (A), 
where ri and r 2 are binary polynomials with di and d 2 ones respectively. Then 
r(A) will have approximately did^ ones, a few twos, and rarely a three. Rather 
than computing r(A) * /i(A) mod g as (ri * r 2 ) * h, it is far more efficient to 
compute it as 

r(A) * h{X) = n(A) * (r 2 (A) * h{X)), 

which requires only (di + d 2 )N operations. Thus the computational complexity 
is proportional to the sum of di and d 2 - 
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On the other hand, the search space for the pair of polynomials (ri,r 2 ) has 
size approximately (^Zi) (d^Zi) > so is proportional to the product of the ri 
search space and the T 2 search space. In practice, there are meet-in-the-middle 
approaches that reduce the size of the search space, see [5] for details and secu- 
rity considerations. Further, the number of nonzero coefficients in ri{X)r 2 {X) is 
essentially the product did 2 - Thus one might say that using a product r = rir 2 
requires computation proportional to the sum di + d 2 while giving security pro- 
portional to the product d\d 2 - In rough terms, this explains why one obtains 
significant performance gains without changing the level of security. In the com- 
mon case of iV = 251 and q = 128, it is common to set r = ri * r 2 -I- r^, where 
each of ri,r 2 ,rs is binary with 8 nonzero coefficients. 

Given the above, multiplication involving the private key f{X) is aided by 
writing 

f(x) = i + p* (/i(x) * f2{x) + MX)). 

3.3 A Fast Convolution Algorithm in Software 

Under the assumption that the coefficients of /i, / 2 , /a are binary, we thus have 
an efficient algorithm for ring multiplication in software. The central idea is that 
rather than storing / or the individual fi polynomials as A-element arrays in 
memory, it suffices to store those array offsets whose locations correspond to a 
nonzero entry. Thus, a polynomial fi{X) = -I- -I- • — h A^^ -I- A^ would 

be stored in memory as the array 191, 178,- •• ,14,2. For convenience, arrays 
representing the /i(A) polynomials are concatenated into a single array which 
we denote b. 

Recall that the coefficients of the product of 6(A) and some general poly- 
nomial a(A) have the form 



Ck = aobk + aibk-i -\ \- UN-ibk+i = ^ aibj. 

i-\-j = k mod N 

The sparse nature of fi{X) causes most of these inner product terms to be 
zero. So rather than employing a traditional polynomial multiplication algorithm 
that expends a great deal of effort computing zero terms, we take a different 
approach. Scanning the b array allows us to calculate only those inner product 
terms which may be non-zero. A particular non-zero coefficient will appear in N 
inner product terms. 

The algorithm begins by zero-initializing an array of coefficients that will 
hold the result c(A) = fi{X)a{X). For each entry of the b array we calculate 
the N inner product terms corresponding to a non-zero coefficient in /^(A). Since 
fi{X) is binary, each non-zero inner product term is simply a coefficient of a(A). 
These terms are individually accumulated in their corresponding location in the 
c array. Repeating this process for all non-zero coefficients calculates fi{X)a{X) 
at a cost of diN additions of log 2 (g)-bit numbers. 

With this procedure in hand, we may compute the overall /(A)a(A) multi- 
plication with the following steps: 
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1. t{X)^a{X)f,{X) 

2. c{X) ^ t{X) * f 2 {X) = a{X) * h{X) * f 2 {X) 

3. t{X) ^ MX) * a{X) 

4. c{X) ^Ck + tk mod N = MX) * a{X) + MX) * MX) * a{X) 

Thus in the common practical case of -/V = 251 and q = 128 with each of 
/i)/ 2)/3 having eight nonzero coefficients, the convolution is computed with 
251 X 8 X 3 = 6024 seven-bit additions and no multiplications. This algorithm is 
thus ideally suited for the low-power, low-clockrate, narrow arithmetic architec- 
tures found in constrained devices. 

Pseudocode for this operation is found as Algorithm 1, where all array offsets 
are to be taken modulo N for clarity in exposition. 



Algorithm 1 . Fast Convolution Algorithm 
Require: b an array of di + d .2 + ds nonzero coefficient locations representing the 
polynomial f{X) = 1 +p* (/i(A) * f 2 {X) + /3(A)), a the array a{X) = Eai, N the 
number of coefficients in f{X), a{X). 

Ensure: c the array where c{X) = f{X)a(X) 

for 0 < j < di do {Compute t{X) a{X) * /i(A)} 
for 0 < k < N — 1 do 

tk+bj <— tk+bj + Ik 

end for 
end for 

for d\ < j < d 2 do (Compute c[X) <— t{X) * /2(A)} 
for 0 < fc < A — 1 do 
Ck+bj t— Ck+hj tk 

end for 
end for 

for 0 < k < N do (Zero out t} 
tk — 0 

end for 

for d 2 < j < ds do (Compute t{X) /3(A) * a(A)} 
for 0 < fe < A — 1 do 
tk + bj ^ tk+bj “t“ Cfc 

end for 
end for 

for 0 < fc < A - 1 do {c(A) ^ /3(A) * a(A) -f /i(A) * /2(A) * a(A)} 

Ck ■(— Ck + tk mod q 

end for 



For sake of comparison, we implemented Karatsuba-Ofman polynomial mul- 
tiplication and Algorithm 1 on a variety of embedded systems. These results are 
found in Table 1. 
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Table 1. Polynomial Multiplication Algorithm Comparison 



Operation 


MC68EX328 Dragonball 
(20 MHz Palm Vx) 


Intel 80386 
(20 MHz RIM 957) 


37MHz ARM7 


Karatsuba 


25 msec 


178 msec 


12.75 msec 


Algorithm 1 


3.2 msec 


28 msec 


1.62 msec 



4 NTRU Embedded Reference Implementation 

The NTRU Embedded Reference Implementation package is designed for use 
in applications where both high performance and small footprint are important 
considerations. The package contains the NTRU algorithm [3], the NTRU Signa- 
ture Scheme (NSS) [6], a random number generation utility, and public domain 
versions of the AES selected Rijndael symmetric cipher and the SHA-1 hash 
function. 

The software library is implemented in ANSI C and is easily ported while 
maintaining high performance. Two important design choices include the use of 
an internal memory management scheme and support for 8/ 16/ 32/ 64-bit envi- 
ronments. 

Internal Memory Management Scheme. Memory allocation on constrained 
devices is typically a source of inefficiency and portability problems. While 
some devices disallow the use of heap management functions (such as mal- 
loc, realloc, and free), others significantly restrict the use of stack space. 
Regardless of which operations are available, there is normally significant 
CPU overhead associated with native memory management functions. For 
efficiency and portability, the implementation establishes its own internal 
memory management scheme. When an application initializes the imple- 
mentation, a block of memory is created from either the stack or the heap 
and used to satisfy the application’s dynamic memory management needs. 
Thus, the implementation’s memory management is abstracted from the ap- 
plication environment, improving portability, security and performance. 
8/16/32/64-bit environments. One of the requirements of the software is to 
support many different devices. Popular microprocessors have word lengths 
ranging from 8-64 bits. To provide maximum flexibility, storage of public 
and private key information as well as intermediate results is generally in 
arrays of 8-bit types and all operations are 8 bits wide. While this provides a 
flexible approach supporting operation on different size devices, it may not 
be the most efficient approach on all devices. For example, on some devices 
a 16- or 32-bit operation has the same cycle cost as a corresponding 8-bit 
instruction. This fact can be exploited when tailoring NTRU for a specific 
platform. 



4.1 NTRU C Performance Results 

The NTRU design decisions lead to a generic software base that can be run on 
many different platforms. Outside of good software engineering practices, there 
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are no platform specific C optimization tricks used in the reference implemen- 
tation. Even without platform specific optimizations, the performance numbers, 
as shown in Table 2, are impressive on a variety of popular processors for con- 
strained devices. In these tables, msec is taken to mean milliseconds. In addition, 
in this and all remaining sections of this paper, we report results for NTRU with 
parameters (N,p,q) = (251, X -|- 2, 128). 



Table 2. NTRU Performance Results 



Operation 


MC68EX328 Dragonball 
(20 MHz Palm Vx) 


Intel 80386 
(20 MHz RIM 957) 


37MHz ARM7 


Key Generation 


1130 msec 


858 msec 


80.6 msec 


Encryption 


47 msec 


39 msec 


3.25 msec 


Decryption 


89 msec 


72 msec 


6.75 msec 



4.2 NTRU Optimized for Palm Computing Platforms 

The Motorola Dragonball microprocessor is widely used in Palm computing plat- 
forms. While the Dragonball supports 8-, 16-, and 32-bit data operations, mem- 
ory is organized into 16-bit words. Although the NTRU fast convolution al- 
gorithm operates on 7-bit polynomial coefficients, each operand fetch actually 
retrieves a full 16-bit word. Assuming the coefficients are organized in memory 
along byte boundries, this leads to twice as many memory accesses as should be 
needed. While an easy choice would be to read the full word and use the two 
bytes separately, the task of extracting anything but the lowest byte in a register 
is more expensive than simply fetching the next byte from memory. 

Since Algorithm 1 is nothing more than repeated coefficient addition, the 
arithmetic requirements on the Dragonball are minimal. The result is that most 
of the time is spent fetching coefficients and storing their sum. A great deal of 
optimization can be achieved simply by making these memory operations more 
efficient. Extensive use of the Dragonball’s post-increment and pre-decrement 
pointer operations makes the code much faster than using pointer offsets, the 
approach taken by the C compiler. 

Another performance-limiting factor is Algorithm I’s use of circular array 
indexing for fast modular reduction. Since the Dragonball has no native support 
for circular arrays, we can simply place two copies of a in adjacent memory loca- 
tions and reduce the burden of pointer arithmetic. Figure 1 graphically displays 
the situation. 

By taking the buffering idea one step further, we can exploit the 16-bit archi- 
tecture of the Dragonball to perform two 7-bit coefficient additions in parallel. 
To this end, we simply pack two 7-bit coefficients into a word and add. Any 
overflow from the add operation can be removed by modular reduction via a 
logical and with 0x7F7F. This effectively reduces each byte over q = 128. The 
main problem with this scheme is alignment of data on 16-bit boundaries. If the 
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offset j is odd, then the above scheme with two copies of a will suffice to perform 
word additions instead of byte additions. If j is even however, none of the words 
would be aligned to perform the addition. If we make a third copy of a, again 
adjacent to the other two, we find that if j is odd, a+ {N — j) is unaligned, but 
a + {2N — j) will be. This is shown in figure 2. 

The current assembly improvements can be seen in Table 3. 
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Fig. 1. Bytewise buffered convolution example; bj 
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Fig. 2. Wordwise buffered convolution example; bj = 2, N = 7 



Table 3. NTRU Palm Assembly Language Performance Improvements 



Operation 


Palm C code 


Palm Assembly/C code 


Key Generation 


1130 msec 


630 msec 


Encryption 


47 msec 


33 msec 


Decryption 


89 msec 


60 msec 



5 NTRU in an FPGA 

Due to its low complexity and parallel nature, the NTRU cryptosystem lends 
itself exremely well to hardware implementation. The primary function in the 
encryption algorithm is the convolution of the public key h{X), by the random 
vector, r{X), as described in Algorithm 1. The nature of the construction of 
r leads to the observation that with overwhelming probability, each coefficient 
of r is at most 15 (i.e., fits into at most 4 bits), with a limit on the number 
of non-zero coefficients. This allows the use of repeated coefficient addition as 
opposed to full coefficient multiplication to implement convolution. 

The encryption engine operates in the following steps. First, the operands 
h, r, and m must be loaded serially, 251 bits at a time. Once the operands 
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are loaded, the engine begins bit-scanning each r coefficient. For each non-zero 
coefficient, the engine adds h to the temporary result. This is repeated a number 
of times corresponding to the value of the current r coefficient. Once this is 
complete, or if the coefficient is zero, h is rotated left by one coefficent (7 bits) 
to perform the modular reduction of the result over . The next r coefficient 
is then scanned, repeating the above process until all r’s have been processed. 
Finally, the engine adds the polynomial m to the result of the convolution and 
outputs this value as the encrypted message. Because of the expansive nature of 
encryption, the encrypted message is output serially. Note that h is retained in 
the encryption engine, and thus successive encryptions only require the loading 
of r and m, which takes 5 clock cycles. 

For the provided implementation, the following tools were used: 

— Synthesis: Synplicity’s Synplify version 6.1.3. 

— Place and Route: Xilinx’s Design Manager version 2.1i_sp6. 

— Simulation'.YiewlogiSs Powerview version 6.1 FusionHDL version 1.4 and 
Viewlogic’s Workview Office version 7.53 Speedwave version 6.202. 

For the provided implementation, the Xilinx Virtex 1000EFG860 FPGA was 
chosen as the target device. The package type chosen provides sufficient I/O (656 
lOBs) and logic resources to satisfy the design requirements. Further information 
regarding the Virtex E family may be found in [8] . Note that the VHDL imple- 
mentation is fully portable to ASIG technology, since no FPGA vendor-specific 
constructs were used in the provided implementation. 



Table 4. FPGA Implementation Results 



Encryption Cycles 


259 


Clock Period 


19.975 ns 


Clock Frequency 


50.063 MHz 


Encryption Time 


5.174 ys 


Encryption Throughput 


48.52 Mbps 


Slices Used 


6373 


Logic Resource Utilization 


51% 


Approximate Gate Count 


60,000 


Approximate Register Gate Count 


40,000 


I/O Used 


506 


I/O Utilization 


77% 



6 Conclusions 

In this paper we have provided practical implementation results for NTRU run- 
ning on a number of embedded systems including microcontrollers and FPGAs. 
In addition, we have provided a new fast convolution algorithm which eliminates 
the need for explicit multiplication in encryption and decryption. 
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Abstract. This paper introduces a new block cipher, and discusses its 
security. Its design is optimized for high-bandwidth applications that 
do not have high requirements on key-schedule latency. This paper also 
discusses several security issues about such an application: harddisk en- 
cryption. 

Keywords: bitslice, encryption, harddisk, mobile computing 



1 Introduction 

Today’s secret key cryptosystems are designed to be versatile enough to fit most 
usages in a wide variety of environments. The recently chosen new american 
standard of symetric encryption, the AES [1], is a perfect illustration of that 
fact: appart from its seemingly good security, it was chosen because it could run 
reasonably fast on a modern workstation, a low-end personal digital assistant, 
a smartcard or a specific ASIC. This speed and easiness of implementation are 
requirements for what the AES was designed to be: a standard; interoperability 
issues imply that all applications must use the same algorithm, so it must be 
good everywhere. 

However, there are some applications where requirements are different: one 
of them is on-the-fly harddisk encryption. Such encryption is needed to pre- 
vent divulgation of important data if a harddisk is stolen, or scanned during an 
inactivity period (some sort of lunch-time passive attack). This is especially im- 
portant for mobile systems, such as portable computers. Another class of attacks 
that could be worth to counter, is active attacks: an attacker modifies data on 
the disk. Even if the modification is essentially random, such tampering should 
be at least detected. 

Let us detail what is needed, and what is not: 

— We need a very fast cipher; security is not a goal in itself, but a necessary evil 
used to protect other jobs; and since modern operating systems implement 
multitasking, only a marginal proportion of the cpu power should be used 
to perform encryption. 

— We do not care about key-schedule latency: the key-schedule is performed 
only once per session, at boot time, and the cost can be further reduced, so 
that it should doable in a user-compatible time (the user will not want to 
wait several minutes every morning). 
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~ We encrypt data-blocks of size multiple of 512 bytes: this is the standard size 
of an harddisk sector; all reads and writes from and to the disk are performed 
with this granularity at least (on some systems, it might be higher). 

— We need to handle random accesses to the disk, with low overhead; we cannot 
afford, for instance, extra physical reads on the disk. 

— We run on a modern computer, with many registers^. 

Classical block ciphers are ruled out for speed reasons; as a rule of thumb, 
the maximum allowed cpu overhead should be 15%, on a 1 GHz cpu, with a 
disk running at 20 Mbytes per second. This means that a bandwidth of at least 
120 Mbytes per second (when full cpu is used) is needed. Algorithms such as 
the AES [1], Blowfish [2] or CS-Cipher [3], although considered as fast, will be 
limited to about 50 Mbytes per second. 

Stream ciphers are also out of the question, due to random access; stream 
ciphers have a state, that needs to be maintained, in order to encipher and deci- 
pher. The initial state depends on the key, and its construction usually requires 
some time. For instance, although the bandwidth obtained with RC4 is high, 
the key schedule is rather slow with regards to the production of 512 bytes of 
stream. Besides, the ciphertext is often too malleable: if the attacker guesses the 
plaintext, he can easily change that plaintext to whatever he wants. 

So we need some sort of very fast block cipher; we present such a cipher in 
this paper. We will first recall the so-called bitslice programming technique, as 
described by Eli Biham [4] , then describe the algorithm itself, and discuss imple- 
mentation and security issues. A final section will explicit some general problems 
related to harddisk encryption, and show how our cipher helps in solving them. 



2 Bitslice 



Bitslicing is an implementation trick, classical among electronicians, but never 
really published, and therefore rediscovered several times. Eli Biham was the first 
cryptographer to document it in [4]; the method basically boils down to an al- 
ternate representation of data that allows software implementations to work like 
hardware ones, with similar optimizations. Bitslice code is also called orthogonal 
code, to refer to this alternate representation. 



2.1 Abilities of Modern Processors 

Modern processors are more and more of the RISC trend; this means that they 
have many, wide registers, and are able to perform bitwise logic operations be- 
tween these registers at high speed. They are however relatively bad at handling 
byte-formatted data, such as ASCII text. 

^ This extends to the PC, although the Intel instruction set does include only a limited 
number of addressable registers; see section 2.1. 
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An emblematic processor is the Alpha [5]: it has 32 64-bit registers, all of them 
being equivalent; there is no specialized register^. All calculating instructions 
take two registers as operands, and a third one as destination. This design is 
a good example of what processors will look like in the future; Intel chosed 
a similar design for its new, market-leading processor, the Itanium [6], which 
should replace the long-lived Pentium family. The Alpha and the Itanium are 
native 64-bit processors. 

Actually, Pentium processors are already quite RISC: they sure still han- 
dle the old 8080-compatible CISC code, with few registers and many complex 
instructions; but most of those complex instructions are there for backward- 
compatibility only, and are slow, so the compilers do not use them. Besides, 
there are many internal registers, and the processor renames, aliases and dupli- 
cates the registers visible to the programmer. Memory accesses to the internal 
cache memory are also made very fast, so we can consider those processors as 
being on the RISC side. The Pentium is a 32-bit processor, but already owns 
some 64-bit registers, in the MMX unit. 

2.2 Orthogonalization of Data 

The natural reflex of the cryptosystem programmer, when a 32-bit data must be 
used, is to store it into a register. This approach has the following drawbacks: 

— When the registers are wider than the data, some of the computing power 
of the processor is lost. 

— Bit permutations cost much; those operations are current in cryptosystems 
since they help in creating a correct avalanche effect. But those permutations 
are a mere data routing, and do not perform any real calculation. 

The orthogonal representation is the following: spread the data among many 
registers, one bit per register. You then calculate the algorithm as a circuit, with 
logic gates that map cleanly to the native bitwise operations of the processor. 
Since those operations are bitwise, they are performed on all bits of the registers 
at the same time; if you have n-bit registers, you perform n instances of the 
algorithm simultaneously. This is heavy parallelism, quite suited to situations 
where you have much data to encrypt, in ECB mode. 

There remains the problem of getting input data to the appropriate storage 
ordering; this is equivalent to the transposition of a matrix. The figure 1 illus- 
trates this transposition. See [7] for a 0(n log n) method of transposition of a 
n X n matrix, when n-bitwise logical operations and shiftings are atomic. 

2.3 Applicability 

With bitslice, bit permutations are “free”: the code just has to use the right 
register. This is solved at compilation time and does not induce runtime cost. 

^ There is actually one: the register 31 contains always 0; bnt since this value does 
not change, it can be safely “duplicated” inside the processor and therefore does not 
constitute a bottleneck. 




276 



T. Pornin 



Data 




Moreover, the ALU (Arithmetic and Logic Unit) is used at its full potential, 
since the whole width of registers is used. However, some operations become 
much more complex: table lookups must be replaced by equivalent circuitry, 
which means HDD (Binary Decision Diagrams); additions require manual carry 
propagation; multiplications are definitely out of the question. 

Moreover, the algorithm must be representable as a circuit; this is anyway a 
desirable characteristic for block ciphers, since data-dependant branches lead to 
timing attacks [8]. 

To sum up, some cryptosystems are well-optimized for bitslice, others are 
not. DES can be implemented very efficiently this way; it was done for the 
DESchall [9] (a software-based DES cracking challenge by exhaustive search of 
the key space). Serpent [10], candidate to the AES, was also designed to be 
implemented using these technics. We present in this paper a new algorithm, 
called FBC (as “Fast Bitslice Cipher”), which is optimized for speed under a 
bitslice implementation. 



3 The FBC Algorithm 

3.1 General Structure 

The FBC algorithm is a r-round Feistel cipher; it works on ru-bit values {w is 
even). The confusion function is simple: each output bit is the bitwise combina- 
tion of two different input bits. Four combinations are used: and, OR, nand and 
NOR. Which combinations are used on which bits, is key-dependant and round- 
dependant. The figure 2 illustrates this setup. Due to practical implementation 
issues, w must not exceed 512. 

The three main ideas are: 

— use a simple, fast round function with many rounds (for instance, r = 64); 
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~ derive each round subkey from the key using a cryptographically strong 
pseudo-random generator (so that attacks on subkeys cannot be proven to 
exist, neither can they compromise other subkeys); 

— make code generation part of the key schedule; this is slow but allows for 
many key-dependant features. 

In a more formal way: for each binary value, we count bits from left to right, 
leftmost bit is numbered 1. For the round i, the input is split into two equal parts 
of size w/2: the left part Li and the right part Ri. There exist w/2 functions 
"L j ^ and two permutations (j)i and ipi of w/2 elements such that: 

Vj, r/ = AND, OR, NAND Or NOR 

Vj, Mj) + Mj) 

The result R of the confusion function is such that the bit J of (1 < j < 
w/2) is equal to t/ {( j)i{j) , . The output of the round is the concatenation 

of L'i and i?(, in that order, where: 

L' = R, 

R) = Li ®Ti 

After the last round, the left and right part of the result are swapped; this is 
made so that the decryption algorithm is exactly the same than the encryption 
algorithm, but the definition of the </>i, tpi and t/ functions. 

To complete this scheme, we must specify how those functions are chosen 
from the master key. We use the master key as a seed for a pseudo-random 
generator, that uses a cryptographically strong hash function. 

3.2 The Pseudo-random Generator 

The secret key is a fc-bit value, where k ranges from 0 to 352. As usual, if k is 
lower than 80, the scheme is to be considered as vulnerable to exhaustive key 
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search attacks. The AES defines key sizes of 128, 192 and 256 bits; FBC allows 
these, and other sizes as well, so any useful security level can be achieved with 
FBC. 

We use the hash function SHA-1 as defined in [11]. This specification defines 
the application of SHA-1 on an arbitrary bit stream, and includes a padding 
method to extend the bit stream to a size multiple of 512. We do not use that 
padding, so “SHA” is to be considered in this paper as “application of the SHA-1 
core function to an unpadded block of 512 bits” . 

The key K is extended to the 352-bit value K' by appending zeroes to its 
right. S' is a 160-bit variable which will contain the “state” of the generator. The 
algorithm is the following: 

— 1. S^ 0 

~ 2. S ^ SUA{K'\\S) 

— 3. The 20 bytes of S are emitted (leftmost first) 

— 4. Return in 2 

(II denotes concatenation). 

Therefore the generator emits bytes. These bytes will be used to choose 
random numbers between 0 and w/2 — 1, which is exactly why w must be at 
most 512; otherwise, the definition of the key schedule should be adapted. 

We will have to choose integers ranging from 0 to some limit n, where n is a 
posivite number strictly smaller than w/2. We calculate m the greatest positive 
multiple of n -|- 1 that is smaller or equal to 256; for instance, if n = 6, we have 
m = 252. 

To choose a random number from 0 to n, we get one byte b from the random 
generator; if this byte is greater or equal to to, we get another byte, until we 
have a value strictly smaller than to. It is easy to see that the average number of 
invocations of the random generator is at most 2, so this process is not especially 
slow. The random number is defined to be the euclidian rest of the division of b 
by n -|- 1 . This process ensures that all integers between 0 and n have an equal 
probability to appear. 



3.3 Choice of a Random Permutation 

We must choose random permutations of w/2 elements; we will use the following 
algorithm: 

— 1. Fill an array p of w/2 elements with the numbers from 1 to w/2 in as- 
cending order (Vi,p[ij = i); this array represents the identity permutation. 

— 2. For i ranging from 2 to w/2 

— 3. Choose a random integer between 0 and i — 1 

— 4. If Oi -I- 1 yf i, swap the contents of p[ai + 1] and p[i] (this is equivalent to 
the composition of p with the transposition {i {at + 1))) 

— 5. End for 
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The chosen permutation is represented by the contents of p at the end of 
the execution of the algorithm (p[5] = 8 means that the permutation sends the 
fifth element of its input to the eigth emplacement of its output) . This algorithm 
ensures that all permutations of w/2 elements have an identical probability to 
be chosen [12]. 

3.4 Choice of the Elements of Each Round 

To choose the elements constituting the round i (the two permutations 4>i and 
ipi, and the w/2 boolean functions r/), we proceed this way: 

— 1. Choose randomly (f>i. 

— 2. Choose randomly 't/i. 

— 3. If there exists at least one j between 1 and w/2 such that (pi{j) = 
go back to 2. 

— 4. For each j from 1 to w/2, get one random byte; the euclidian rest of the 
division of that byte by 4 is a value between 0 and 3. The function t/ will 
be a AND, OR, NAND and NOR for values of, respectively, 0, 1, 2 and 3. 

The elements of each round are chosen from the first round to the last. On the 
average, for each c/i, we will have to try e « 2.7 permutations /ji before finding 
one matching the criterion of point 3 (finding a matching </>i is equivalent to 
finding i/i o (j)~^ , permutation with no fixed point; see [13] for the proportion of 
such permutations among all permutations oi w/2 elements). 



4 Implementation of FBC 

4.1 Software Implementation 

FBC is designed to be implemented in software, using bitslicing techniques. Such 
code is rather difficult to write, but we developped some sort of automatic tool 
to produce bitsliced C code from an ad hoc description of the algorithm, which 
can be generated from the key schedule algorithm. That tool is not very well 
developped but is available for free download and use (see [14]). The C code 
generated for a fully deployed 64-round FBC is a huge function with about 5000 
local variables and 5000 statements; the C compiler fails utterly on such input, 
so the code must be sliced into small groups of four rounds or so, easier to 
understand by the compiler. 

We ignore the cost of othogonalization of data before encrypting and af- 
ter encryption; actually, not performing such orthogonalization is equivalent to 
performing a known, fixed permutation on input and output data blocks (the 
512-bit blocks if we perform 64 parallel encryptions with a 64-bit block size). 
Such a permutation has no security implication, so we can add that permutation, 
which actually voids the cost of orthogonalization. 

Encryption bandwidth achieved for the moment on an Alpha 21164 processor 
running at 500 MHz is about 32 Mbytes/s using FBC with 64-bit words and 
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64 rounds (w = 64, r = 64); the code is not yet fully optimized and we are still 
working on it. This speed is half the speed requested (we wanted 120 Mbytes/s 
on a 1 GHz processor) but is explanable by the relatively old design of the 21164 
(that processor was first announced in August 1994, which is a very remote epoch 
in the rapidly moving chip industry). The 21164 can issue up to four instructions 
in each cycle, but only two load/store instructions and two logical instructions. 
Moreover, a load instruction cannot occur during the second next cycle of a 
store instruction. This restriction, bound to the aging design of the 21164, is 
responsible for the seemingly bad performance of FBC on that processor; the 
newer 21264 does not have that problem. 

The number of logical r functions to calculate for one encryption is ^ . If 
n is the size of each register, the bitslice code will calculate n parallel instances 
of the algorithm, thus encrypting wn bits. The number of function evaluations 
needed for each data bit is then Since the wanted rate is about one 

bit per clock cycle, we must execute ^ r functions per cycle (it is worth noticing 
that this value does not depend upon the block size used). 

Bitslicing code uses many registers, much more than the really available 
registers in the processor; therefore, those are to be considered as a cache on the 
stack, where the values are stored. So data management still comes up as the 
most constricting issue. Each r function will require two input operands and one 
output operand; since each input bit is used twiced in two different r functions, 
the number of memory management operations needed can be reduced to one 
load and one store for each function. Due to the restrictions on such operations, 
an average of 3 cycles are needed per function. With the unavoidable additional 
cost of data transfering (this is administrative task outside the core of the cipher), 
this explains the “low” rate achieved on the 21164 (that rate is equal to the best 
rates achieved by ciphers such as the AES on the same machine). 

However, the current market-leading Alpha processor is the 21264, which has 
much lower restrictions on memory accesses; from its specifications, it should 
achieve the correct performance (one cycle per bit enciphered), whereas classical 
cryptosystems, which use more complex structures of the processor, will not 
benefit as much of the generation shift (speed measurements [15] from the AES 
competition show that the fastest candidates would run at 2 clock cycles per bit 
enciphered on a 21264-equivalent processor). Optimization of the code on the 
21264 architecture is still undergoing work. 

To the very least, “64 rounds” is a conservative number, in a security point 
of view. That number could be lowered, and the speed of FBC would raise 
correspondingly. 

4.2 Hardware Implementation 

FBC is well-suited for FPGA (“Field Programmable Gate Arrays”) implemen- 
tations. FPGAs are programmable chips, which can host any circuit, and can be 
redesigned in little time (less than a second) with no loss. 

For a FPGA implementation, the key schedule algorithm would produce a 
circuit design, to be loaded into the FPGA. Bitslicing (which is parallelization 
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in space) becomes pipelining (parallelization in time), so a fully deployed FBC 
should run at one block per cycle; since each round is very simple (only one 
logic gate layer per round), a very fast clocking rate could be achieved. Since 
late Xilinx FPGA [16] chips may run at rates over 150 MHz, an instance of 
FBC with 64-bit words on such a chip would encipher over 10 Gbits of data per 
second; this is the fastest data rate achieved by the best production optic fibers. 

If speed is at stake, specific hardware usable, and key schedule time unim- 
portant, then FBC is the way to go. 

5 Security of FBC 

Security of FBC is based upon the following paradigms: 

— many rounds, 

— unpredictable random subkeys, 

— key-dependant permutations and non-linear functions. 



5.1 Many Rounds 

It has been said for a long time that “take whatever round function you want, 
it will be secure if you put up enough rounds”. This assertion used to be a joke, 
but it actually makes much sense. 

Modern cryptanalytic attacks, such as differential and linear cryptanalysis, 
tend to have a complexity exponential in the number of rounds; especially, if the 
probabilistic advantage of the attacker is 1/2 on one round, then 64 rounds will 
lower that advantage to 2“®"^, a quite appropriate number for a FBC operating 
on 64-bit blocks. 

5.2 Unpredictable Random Subkeys 

Most cryptanalytic attacks use the fact that information on the key used in 
one round somehow shows up in a predictable way in some other rounds. Thus 
FBC produces all key-dependant round material with a cryptographically strong 
pseudo-random generator, seeded by the master key. If any information, learned 
or guessed, on the subkey of some rounds can be applied to another round by the 
attacker, then this would contradict the strength of the generator. Besides, even 
if all key-dependant material are guessed, thus giving some strong knowledge 
about the output of the generator, it would still be computationnaly infeasible 
to guess the master key; thus, a successful attack on a FBC-encrypted link with 
a simple daily key-updating policy would be limited to one day of decryption. 

5.3 Key-Dependant Permutations and Non-linear Functions 

In FBC, the permutations are made key-dependant as an attempt to make the 
avalanche effect unanalyzable by the attacker. One consequence is that permu- 
tations cannot be guaranteed to be “strong”. We did some sample measures of 
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the avalanche effect; here is the average number of data bit potentially modified 
by one bit after several rounds, for w = 64: 



rounds 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 



bits potentially modified 

2.00 

4.91 

11.05 

22.81 

39.54 

54.76 

62.34 

63.93 

63.99 

64.00 



This means that the total avalanche effect is obtained in 10 rounds, to com- 
pare with the suggested value of 64 rounds. 

The T functions are dependant on the key, and chosen among the four func- 
tions AND, OR, NAND and NOR, which are the four symetric non-linear boolean 
functions. Given random inputs and random functions in this set, the output is 
well-balanced and statistically not correlated to the input. 



5.4 Security Sum Up 

The FBC design looks quite secure, with the rule of thumb r = w, which means 
“as many rounds as bits in the block size”. This is a conservative estimation, 
based upon the assumption that if the block size if 64 bits, then the scheme 
should be secure against attacks using 2®"* adaptively chosen plaintexts, which 
is far from being applicable in real life. Typically, if the enciphered text is a 
harddisk, the maximum amount of ciphertext provided is about 2^^ blocks or 
so. Yet, security margins should not be too much fiddled with. 



6 Harddisk Encryption 

The problem of harddisk encryption is complex, and depends upon the type of 
attack considered. We will consider passive and active attacks, and detail the 
threat model and several corresponding solutions. 



6.1 Passive Attacks 

The model is the following: a computer stores confidential data on its harddisk; 
the computer or its disk might be stolen while it is not powered, therefore the 
data on the disk must be stored encrypted only. The attacker is supposed to 
be able to guess most of the encrypted plaintext, and must not learn anything 
about the remaining plaintexts. 
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Several products already address, or try to address this security issue. Some 
work on a file basis, masking the real names and internal contents of the files; 
this maps cleanly to network filesystems protocols such as NFS or Samba, which 
use file-oriented semantics. However, this leaks information, mainly the number 
of files created, their sizes and modification dates. Therefore the real security 
provided by those systems is only marginal. 

Other products build up real enciphered blocks of data, upon which standard 
filesystems are applied. Those products use classical block ciphers and induce a 
performance hit which leads users into reserving encryption for really important 
data only. This may help the attacker know what computers hold his target inside 
a large company; good security can therefore be achieved only if all harddisks 
are completely enciphered, which is possible only if the performance hit is very 
small. 

FBC is designed to encipher data in ECB mode, so that the parallelism 
given by the bitslice representation of data can be used. ECB mode has the 
following problem: input blocks are not randomized. Real life data is often very 
redundant, and equal blocks will be enciphered the same, and the attacker will 
be able to detect them. The countermeasure is to “add” a counter: each block, 
before encryption, is combined with its block number prior to encryption, with 
an addition or a bitwise XOR. The cost of such modification is neglectable with 
regards to the cost of encryption itself (on an Alpha, it will cost one cycle for 
64 bits of data) . 

One FBC issue is that the key schedule implies the generation of code, a 
rather slow process (it can take up to several minutes) and which uses much 
code (a C compiler is not a small application, usually). This can be addressed 
the following way: the result of the key schedule, that is, the code that encrypts 
and decrypts, is stored encrypted on the disk, using some other block cipher, the 
AES for instance. The decryption is done only once at boot time, so there is no 
real performance issue here. The key used to decipher the FBC code needs not 
be the same key as the FBC one. 



6.2 Active Modifying Attacks 

We consider here the following model: the computer is stolen while being unpow- 
ered, its contents are modified, and the computer is put back in place before the 
theft is noticed. A random and destructive modification cannot be prevented, 
but we want to be sure that it would not go unnoticed. No existing product 
actually addresses this issue. 

The classical solution is to store a MAC, which is easily built up with a hash 
function: the entire encrypted disk content, appended to some secret key, is 
processed through the function, and the result is written to some non-encrypted 
area of the disk (one such area must exist, to store the base decrypting software, 
that asks for the user key). The major drawbacks of this approach are: 

— The speed of the process is limited to the speed of the disk, so it can take 
an impressive amount of time (one hour on today’s typical disk) ; this has to 
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be performed at boot time, and no work can take place during this check. 
This is not acceptable from the user’s point of view. 

— More critical, the MAC must be calculated again at shutdown time. This is 
even more impossible to force on the user, especially since some shutdowns 
are due to OS crash or low battery. 

— Even if the disk is logically almost empty, its whole content must be pro- 
cessed. An alternative is to build the MAC only on the blocks that contains 
allocated data, but this still uses a gruesome amount of time (today’s basic 
operating system installation uses several hundred megabytes of disk space, 
not including oversized applications). 

Here is one solution addressing these problems: each block is double-encryp- 
ted in the following manner: if the block number N contains P, and E is the 
encryption function, then its encrypted counterpart is C = E{E{P) © N). For 
each file, the exclusive or of all its constituting blocks is stored inside a file- 
specific structure that is also encrypted (on some systems^, such a structure 
exists and is called an inode). This encrypted XOR is the MAC. 

Using this scheme, the verification of files can be made asynchronously, as a 
background task; only a locking procedure must be used so that an individual 
file may not be used prior to its verification. More important, the MAC of each 
file is maintained during normal operation; this means that a modification of a 
file requires the reading of either the overwritten data (so that its contribution 
to the file MAC can be taken away) or the remaining data (that is the recompu- 
tation of the MAC). This is not an important cost, because, most of the time, 
modifications of files are either appending data to the end of the file, or emptying 
the file and rewriting it from scratch. Anyway, when modifying a file, the size of 
the extra reads is limited to half the size of the file. 

The XOR with the block number between the two encryptions ensures that 
the data blocks are not swappable by the attacker; this operation is isolated from 
the plaintext and from the ciphertext by the two encryptions. The assumption 
is that the block cipher is a random permutation, therefore any modification to 
the ciphertext leads to a random, uncontrollable modification of the plaintext. 

The main drawback of this scheme is the double encryption, which halves 
performance. Therefore the use of a very fast cipher is critical for such a design. 

7 Conclusion 

We presented a new cipher, FBC, designed to achieve high encryption speed on 
a modern workstation, adapted to on-the-fly harddisk encryption. We presented 
some arguments with regards to its security, and discussed some implementa- 
tion issues. We also discussed some issues related to the problem of harddisk 
encryption and presented a generic scheme to ensure data integrity at low cost. 

We believe that such work will be more and more used in the future, as mobile 
computing is generalizing at a fast pace. As a side note, OpenBSD [17] (a Unix- 
like system specialized in security) already includes an encryption mechanism for 

® Actually, all Unix-like systems, including MacOS and Windows NT/2000. 
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its swap space. An open question (which does not apply to swap space, since the 
contents of swap space are not used across reboots) is the possibility of building 
a secure integrity verification scheme, that does not imply a complex shutdown 
procedure, neither double encryption, nor too many extra reads when a file is 
modified. 
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Abstract. Sliding Windows is a general technique for obtaining an ef- 
ficient exponentiation scheme. Big Mac is a specific form of attack on a 
cryptosystem in which bits of a secret key can be deduced independently, 
or almost so, of the others. Here such an attack on an implementation 
of the RSA cryptosystem is described. It assumes digit-by-digit compu- 
tations are performed sequentially on a single fc-bit multiplier and uses 
information which leaks through differential power analysis (DPA). With 
sufficiently powerful monitoring equipment, only a small number of expo- 
nentiations, independent of the key length, is enough to reveal the secret 
exponent from unknown plaintext inputs. Since the technique may work 
for a single exponentiation, many blinding techniques currently under 
consideration may be rendered useless. This is particularly relevant to 
implementations with single processors where a digit multiplication can- 
not be masked by other simultaneous processing. Moreover, the longer 
the key length, the easier the attacks becomes. 

Key words: Cryptography, RSA, differential power analysis, blinding, 
DPA, smart card, exponentiation, sliding windows. 



1 Introduction 

Timing analysis and differential power analysis (DPA) techniques [8], [9], [2], [1] 
show that RSA cryptosystems [13] suffer from implementation weaknesses rather 
than lack of algorithmic strength. The secret signing or decryption exponent 
d often seems easy to recover from a smart card or other dedicated embedded 
system using DPA [8], [1], [3], [10], [4[. These attacks start by averaging a number 
of power traces in order to remove dependencies other than the quantity being 
sought and to reduce the effect of random noise. For the card described in [10] 
which uses the standard square and multiply algorithm for exponentiation, this 
immediately reveals the exponent because of the different shape of power traces 
for squarings and multiplications. 

The power-related property on which the current attack is based depends on 
the fact that switching a gate consumes more power than not doing so. Generally, 
these tiny effects are submerged in too many other data dependent variations 
to be easily extracted. However, here we develop a novel way of combining sec- 
tions of power traces which enhances the effect into a potentially very powerful 
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technique. We show that different multiplicands can be distinguished. As a re- 
sult, the so-called m-ary and sliding windows methods of exponentiation [6], [7] 
become vulnerable as well as the square-and- multiply method. 

A generally touted solution to this problem is to use different exponents with 
a randomly generated component on each decryption. In particular, Kocher, [8] 
§10, suggests using d+rcj){M) as the decryption key instead of d where M is the 
modulus and r is a random number generated anew for each decryption. This 
blinding certainly hides the exponent if averaging over a number of different 
decryptions has to be performed in order to reduce noise to levels at which the 
data dependencies are revealed. However, our simulations suggest that combining 
different sections of the power trace for just a single exponentiation may be 
sufficient to reveal the exponent, thereby negating the value of this type of 
blinding. Without such blinding, the technique certainly reduces the sample set 
that needs to be considered for DPA to be successful and implies that some sort 
of blinding should be a requirement in relevant cryptographic standards. 

A Big Mac Attack on a secret key d is a method which enables d to be revealed 
bit by bit by nibbling at sections of d in any order. The implied independence 
of the derivation of different bits means that the total data and processing time 
required are only linear in the key length. This contrasts strongly with the math- 
ematical strength of RSA, which is believed to be exponential in the key length. 
A well known brand product is so generously large as to be impossible to have 
a bite taken out of the whole at one go — like the method of attack, it must be 
nibbled at and consumed by tackling individual layers one by one in any order. 
Using DPA or other source of side-channel leakage, a similar arbitrary order 
of considering bits can eventually reveal the whole key, as we demonstrate. An 
example of another such attack, using timing information, was given in [16]. 

The context in which the attack may be mounted is a typical one for small 
embedded systems such as smart cards. We just require that a single fc-bit mul- 
tiplier be used to perform the RSA exponentiations in a digit sequential fashion, 
preferably with no other concurrent processing in progress. 



2 Notation 

An RSA cryptosystem (resp. signature scheme) over the integers [13] consists of 
a modulus M = PQ, which is the product of two large primes, and two keys 
d and e satisfying = A modM. Message blocks A satisfying 0 < A < M 
are encrypted (resp. verified) with C = A® mod M and decrypted (resp. signed) 
using A = modM. The key e is generally chosen small with few non-zero 
bits (e.g. a Fermat prime, such as 3 or 17) so that encryption is relatively fast. 
The key d must be picked to satisfy de = 1 mod(/)(M) and therefore it usually 
has length comparable to M . The owner of the cryptosystem publishes M and 
e but keeps secret the factorization of M and the key d. Breaking the system 
means discovering d and is equivalent to factoring M, which is computationally 
infeasible for the size of primes used. 
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The computation of A'^ mod M is characterised by two main processes: mod- 
ular multiplication and exponentiation. Our main assumption is that the im- 
plementation has a /c-bit architecture and uses a single /cxfc-bit multiplier to 
compute modular products (AxB) modM. So, except for the exponents d and 
e, each number X has a representation of the form X = where r = 2^ 

is the radix or base of the representation, the coefficients Xi are its digits, and n 
is the number of digits required. The precise form or range of these digits is not 
important but we will see later that the larger n is, or the smaller k is, the more 
likely the attack is to succeed. The method is easily adapted to cases where the 
digit multiplier is not square. 

2.1 Exponentiation 

Exponentiation is often performed using the m-ary method [6] for which the 
exponent uses a representation with base m (here assumed to be a power of 2) : 
d = Yll=o dirrd. The powers C* modM {i = l,2,...,m— 1) are pre-computed and 
allocated to table entries . Then a partial product is repeatedly raised to the 
power m by squaring and the pre-computed power of C corresponding to the 
next digit of d multiplied in: 

The to-ary (Modular) Exponentiation Algorithm 

{ Pre-condition: d = } 

C« := C ; 

For i := 2 to m-1 do 

C(d .= mod M ; 

p ;= . 

For i := t-2 downto 0 do 

Begin 

P := P"" mod M ; 

If di yf 0 then P := PxC^'^’^ mod M ; 

End ; 

{ Post-condition: P = mod M } 

The sliding window technique [7] is a straightforward generalisation of this 
which makes more efficient use of the presence of zero bits in the exponent. 
It employs a mixed basis representation of the exponent, using powers of 2 
and m. Only the odd powers need to be pre-computed and stored. The 
attack described here applies identically to this technique apart from the obvious 
modifications as a result of slightly different pre-computations, so it suffices to 
illustrate the ideas using the m-ary method. 

Hardware power consumption depends critically on bus movement involved 
in low level operations such as fetching instructions, reading from and writing 
to memory, etc. Since the long integer multiplications take a large number of 
cycles to perform and a large number of consecutive multiplications are executed, 
attackers are usually able to establish correctly the boundaries in the power 
traces between the operations in the algorithm above. 
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2.2 Modular Multiplication 

Each long integer multiplication or squaring consists of a large number of indi- 
vidual digit-by-digit multiplications. Normally the modular reductions are inter- 
leaved within the iterations of the multiplication: 

Classical Modular Multiplication Algorithm: 

{ Pre-condition: A = } 

R := 0 ; 

For i ;= n-1 downto 0 do 

Begin 

R := rxR + a^xB ; 

qi : = R div M ; 

R : = R - qi X M ; 

End ; 

{ Post-condition: R = (AxB) modM } 

Montgomery’s version of long integer modular multiplication [11] has a similar 
structure, just reversing the order of processing the digits Ui. 

Both the classical algorithm above and Montgomery’s version are usually 
implemented in a way which makes them behave identically as far as this attack 
is concerned. The main variation worth highlighting is that for each long integer 
multiplication of the exponentiation either input A or B may be chosen as the 
pre-computed power of the initial ciphertext C. For convenience, we assume this 
power of C is the first argument, namely A, in the above code. However, to avoid 
unnecessary movement of data, the hardware must usually choose the same order 
for every multiplication. Then it is easy for an attacker to try both possibilities 
and select the one which provides the expected correlations. 

3 Selecting &; Averaging the Power Traces for Big Mac 

The attack requires side channel leakage which has a dependency on the data 
being processed by the multiplier. Apart from measuring power consumption 
of the whole chip [4], the methods of Gandolfi et al. [5] could be directed to 
measuring EMR from the multiplier itself. 

Assume that discrete sampling of the cryptographic device provides a power 
(or EMR) trace function fr : Z — >■ R for the pre-computations and exponenti- 
ation for a single decryption or signing. The definition of tr outside this com- 
putation interval is irrelevant here. Suppose further that the regular sampling 
provides a non-zero number of values for every digit multiplication. The more 
frequent the sampling, the better the results obtained for this attack, especially 
if a number of measurements can be made during each clock cycle. Typically, 
the standard smart card clock runs at 3.57 MHz and the current is sampled 
at 200 MHz, yielding a ratio of nearly 2® to 1. This current is recorded using 
one or two bytes per measurement. As far as possible, such sampling should be 
synchronised to take place at the same points of each clock cycle. 
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Sommer [12] noted that certain points in the clock cycle have much greater 
value for determining data dependencies than others. Initially, as gates are 
switched along paths in the multiplier, the current will be higher and be depend- 
ent on the activity. However, at the end of a clock cycle the combinational logic 
should have stabilised, and it will have a much lower data dependent contribu- 
tion. We are only interested in points with data dependent power consumption. 
Assume that several such points have been identified in the clock cycle, and 
we are able to take a weighted average of them in the trace so that, as far as 
possible, the data dependent contribution to the power represents the number 
of gates being switched and any measurement errors are averaged out. All other 
points must be discarded from the trace, leaving only data dependent ones. 

The main loop of the long integer modular multiplication algorithm contains 
a repetition of fc-bit multiply-accumulate digit operations of the form 

Tj + rxcarry := r j-i + UiXbj + carry (0 < j < n) 

which take place in a single cycle. It is only the sub-traces for these operations 
that are used in the attack. The sections of the trace corresponding to these 
can be identified easily because, by using the multiplier, they differ substantially 
from sections corresponding to other operations. 

Suppose we have already distinguished squares from multiplies and wish to 
establish the value of the exponent digit, say ds, associated with the sth long 
integer multiplication. Let trsij denote the function obtained by setting tr to 
0 outside the sub-interval during which the attacker expects the digit product 
UiXbj to be computed within the sth multiplication, and then translating that 
subinterval to [ir, (i-l-l)T— 1] where r is the common number of sample points 
for each such digit-by-digit multiply-accumulate. (After deleting irrelevant points 
and averaging as necessary, we may well have reduced t to 1.) 

Assuming, as stated, that A is the input which is a pre-computed power of 
C, define trsi = ^ be the function given by averaging the trgij 

over all j. So trsi depends on the single digit at of A but all the digits bj and 
rj of essentially random numbers B and R, and some carries. This averaging 
should produce a function tcgi for which the random variable associated with the 
value at any given point has contributions to the variance from its dependence 
on B and from random noise, both of which are only ^ times those for 
and for equivalent positions in tr. Because the multiply-accumulate operation 
uses k times as much hardware in Oi- and 5j-dependent computations than for 
accumulating the carry and rj-\ digits, the contributions from and the 
carry are certainly lower, perhaps by k times, than that from B. Hence the 
clearest correlation that trsi should exhibit will be with the value of a*. 

This averaging of the traces over the digits of B replaces the usual DPA 
averaging of traces over a number of different exponentiations. On the reasonable 
assumption that B is sufficiently random and has a number of digits, the resulting 
average trace will then have little dependence on B. (If the pre-computed power 
is the B input, we sum over i instead of j to obtain a result which again depends 
on a single digit of the pre-computed power of C.) 
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Lastly, define : Z — >■ R by tva = As tVs is the concatenation of 

the non-zero sections of the tvsi, it has a non-zero definition on [0, nr— 1] whose 
strongest data dependency is through the pre-computed argument A of the sth 
multiplication, i.e. the power of C corresponding to the exponent digit dg. The 
obvious question to ask is whether this dependency is strong enough to identify 
ds since then the secret exponent d can be discovered. 



4 Simulation 

In order to investigate the feasibility of the attack, a simple fc-bit multiplier was 
simulated. It was built mostly from standard 3-to-2 full adders with a carry 
propagator and had a variable size k. This was used to count gate switching in 
the combinational logic only, with no account being taken of changes in registers 
which might contribute to power use. 

Data-dependent power usage is immediately apparent when gate counts are 
partitioned into subsets according to the Hamming weight of the two inputs. 
There is a very clear increase in the number of gate switchings as the Hamming 
weight of either input is increased. Tables of these values displayed a difference of 
a little over k gate changes between adjacent cells in the centre of the table, where 
both Hamming weights are approximately k/2 and most input pairs are clus- 
tered. Except for extreme Hamming weights, the table entries were almost linear 
in each Hamming weight — sufficiently so to explain and justify the arguments 
below. Moreover, the results were essentially symmetrical, i.e. the same num- 
ber of gates were switched on average when the two inputs were interchanged. 
This occurred under several configurations even though no attempt was made 
to balance the number of gates switched. 

For a variety of values of k, modulus bit lengths and exponent bases m, a num- 
ber of random sets of powers {C'd) ^ ..., were generated. These were 

used as input A of the modular multiplier and, to simulate the pre-comuptations, 
multiplied by a random long integer B to create a trace tvi associated with each 
The trace consisted of a vector of gate switch counts for each digit of 
These individual counts were the sum of the gate switch counts for each product 
of the digit of by a digit of B. The traces then corresponded to the power 
traces With the component from B averaged, the trace tvi for each cor- 
responded closely to the vector of true average gate switch counts for the digits 
of In particular, this meant the trace was reasonably characteristic of 
and its elements were closely related to the Hamming weights of the digits. 

To simulate exponentiation multiplications, another random long integer B' 
was chosen, multiplied by a random member of {C^^\ ..., and the 

trace trs' of gate switch counts created. Like trs, it was close to the true average 
gate switch counts for whichever had been selected. The trace was matched 
up with the traces tri of each . Specifically, the Euclidean distance between 
it and every tvi was computed, and the closest chosen to predict i. 

The attack simply requires this prediction to be correct. For many typical 
values of k, n and m, the attack invariably succeeded. Table 1 gives the means 
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Table 1. Gate Switch Statistics for 512-bit Modulus with m = 4. 



Multiplier Size 


fc = 64 


fc = 32 


fc = 24 


fc = 16 


00 

II 


Av to nearest 


4973 


2709 


2538 


2428 


2245 


SD to nearest 


2582 


1482 


1334 


1183 


1024 


Av to others 


17981 


24312 


19834 


23475 


19793 


SD to others 


1232 


513 


408 


481 


217 



and standard deviations for i) the distances between trs' and the correct tvi and 
ii) trs' and the incorrect tvi. The difference between the two cases is startlingly 
large. Table 2 shows low error frequencies even in the worst cases, namely for 
the largest k and smallest n. If the number of bits in the modulus length is 
fixed, then the average distance to the nearest trace increases as k increases so 
that difference between nearest and non-nearest traces decreases. For fixed k, 
increasing the size of the modulus provides more digits over which to average 
and more elements in the vector, thereby improving the ability to determine the 
multiplier correctly. As one would expect, increasing m just makes the nearest 
trace closer and increases the variance in the distances to the rest. 



Table 2. Gate Switch Statistics for 32-bit multiplier with m = 8. 



Modulus Length 


256 bits 


384 bits 


512 bits 


768 bits 


1024 bits 


Av to nearest 


1529 


2366 


3750 


4501 


6246 


SD to nearest 


885 


1403 


2386 


2535 


3612 


Av to others 


5890 


11753 


17896 


32594 


53070 


SD to others 


1108 


2412 


2279 


4646 


4581 


%age errors 


0.9284 


0.1155 


0.2819 


0.0000 


0.0000 



Squares and random products were distinguishable from multiplications by a 
because their traces were not close to any tri. Indeed, the statistics for each 
were similar to the non-nearest table entries. Thus, all long integer multiplicative 
operations, including squares, could normally be correctly distinguished in the 
simulation and hence the secret key recovered. 

5 Distances between Power Traces 

Suppose tvsi and trs 2 are a pair of power traces constructed as above for the 
slth and s2th multiplications of the exponentiation. As the traces are real- valued 
functions on the integer subinterval [0, nr— 1], they represent points in R"”^. 
Define d to be the Euclidean metric on and let <i(sl, s2) be the distance 
between the points defined by tVsi and trs 2 - One advantage of such a metric 
is that places where the traces differ most contribute much more highly to the 
distance between traces than places with the smallest differences. This should 
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help to emphasise the contribution from parameter A, which is approximately 
n times the contribution from other parameters. It is important to omit from 
this metric the points without a noticable data-dependent contribution as they 
reduce the visibility of the data dependence which needs to be observed. 

For equal exponent digits dsi = ds 2 the corresponding multiplications share 
the same first argument. Since there are no other strong data dependencies, the 
value of d(sl,s2) should be small, corresponding purely to noise and variation 
from the average of the digits appearing in the other arguments. For different 
exponent digits dsi yf ds 2 the value of d(sl,s2) should be noticeably larger 
because of the greater dependence on the first arguments, which are different. 

According to the simulation, the data dependent contribution to power con- 
sumption is roughly proportional to the Hamming weight of the arguments. So 
we can expect the distance between two traces to be approximately related to 
the distance between the vectors consisting of the Hamming weights of the digits 
of the multipliers A. Since the Hamming weights of digits are distributed bino- 
mially, it is easy to obtain statistics for the random variable associated with the 
distance between two such vectors and see that it has very similar behaviour to 
that observed in the simulation. Hence this gives an accurate guide to the effect 
of changing any parameters and enables accurate error predictions to be made. 
In particular, it justifies the observation that distances between pair of traces 
cluster around two points, one of which is 0. 



6 Identifying Equal Exponent Digits 

Next we present an algorithm for partitioning the set T = {0, 1, 2, ..., t— 1} of 
exponent digit indices into subsets for which the corresponding digits of d are the 
same. This partition, p, has to define m subsets, one for each (exponent) digit 
value in base m. The subset containing the zero exponent digits should already 
have been identified by using the ability to distinguish between (long integer) 
squares and multiplies to observe which exponent digits have no corresponding 
muliplication in the exponentiation algorithm. For the other digit subsets, the 
association of each subset with a particular non-zero base-m digit is performed 
in the next section. 

The algorithm puts the indices either into a new subset of the partition, or 
into the subset of indices which is “nearest” in an obvious sense: the distance 
between a single point s and a non-empty set of points S is defined here as 
d{s, S) where S is the centroid of S, i.e. S = |S'|“^ X^s'eS ■ 

For each pair of (non-zero) exponentiation digits with indices si and s2, ar- 
range the distances d(sl,s2) into descending order, and set up m— 1 buckets to 
receive sets of indices, one for each exponent digit value. Then consider the pairs 
(si, s2) in order of decreasing distance between their two traces: 

i) If both indices are in different buckets, then move to the next pair. 

ii) If there is one unassociated index and an empty bucket then place that index 
in the empty bucket and again move on to the next pair. 
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iii) If neither index is associated with a bucket and there are (at least) two empty 
buckets, put the indices into separate empty buckets, and move to the next pair. 

iv) If both indices are in the same bucket, compute the distances of si and s2 
from the set of indices in each bucket. If both are already in the nearest bucket, 
move on to the next pair, but otherwise, move si and s2 into their nearest 
buckets, moving the nearer one first and recomputing distances before moving 
the second. Then move to the next pair. 

v) If there is an unassociated index and no empty bucket, then put the new 
index in a temporary extra bucket, compute the distances between every pair of 
buckets and combine the pair of buckets which are the shortest distance apart 
to restore the original number of buckets. Move to the next pair. 

vi) If neither index is associated but there is only one empty bucket, compute 
the distances from si and s2 to each non-empty bucket. If si is the nearer to its 
nearest bucket, then put si into that bucket and s2 into the remaining empty 
bucket. Otherwise, put s2 into its nearest bucket and si into the empty bucket. 
Then move on to the next pair. 

vii) If neither index is associated and there are no empty buckets, then perform 
(v) for both si and s2 individually. 

With perfect data, the algorithm should first treat all the pairs (si, s2) which 
correspond to different exponent digits and correctly put them into different 
buckets or find that they are already in different buckets. Then, from some 
point on, all pairs correspond to equal digits and so the indices should be found 
in the same bucket. The algorithm does not place indices in the same bucket 
until there are no empty buckets left. So it is likely for indices with the same 
exponent digit to be initially spread over several buckets. These buckets then 
need to be coalesced to provide empty buckets for unassociated indices. Process 
(v) does this. Once there are no empty buckets left, then action (iv) is used to 
ensure that the best assignments have been made previously. 

With perfect information, each element of T can be assigned to one of the 
partition subsets by calculating at most to— 1 distances. So fewer than mt dis- 
tances are required to establish the partition correctly if all distances are clearly 
and correctly distinguished as small or not. Hence, with up to t{t—l)/2 pairs in 
total, there is considerable extra information to improve and confirm the con- 
struction of p as it progresses. However, in case of error, all assignments can be 
ranked using distances to buckets, and the most likely tried first for correctness. 

7 Associating Digit Values with Exponent Positions 

The partition p yields (to— 1)! possibilities for the key d, corresponding to 
the possible associations^ of non-zero digits from 1 to to— 1 with the to— 1 

^ We have not assumed any knowledge of the modulus M. However, as Adi Shamir 
pointed out during the presentation, if M and e are known, then in this section one 
can probably make the correct association by using the fact that the bits of the top 
half of the exponent coincide with those of a small multiple of M . 
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different non-zero equivalence classes for the induced equivalence relation on 
T = {0, 1, 2, t— 1}. However, the pre-computation of the powers (7^*^ for 
i = 1 , 2 ,..., TO— 1 means that we have a known multiplication involving (7^*^ 
for each exponent digit except 0 and to— 1, namely (7^*+^^ = (7^*^x(7 modM in 
the case of TO-ary exponentiation and = (7^*)x(7^^^ modM in the case of 

sliding windows. 

Following the algorithm of the previous section, each trace tri correspond- 
ing to the pre-computational multiplication with first argument A = (7*-*^ is 
associated with its nearest bucket of exponent digit indices. This bucket is then 
labelled “i” and should correspond to exponent digit i. Ideally, this should not 
associate two labels with one bucket, and should leave one bucket unlabelled. 
This last bucket is labelled with the remaining exponent digit, namely to— 1. 

If inconsistencies arise from this labelling, then it is easy to rank each possible 
labelling using distances from each tvi to each bucket. Each labelling can be tried 
in turn until overall consistency in achieved. As the m-ary method uses significant 
memory when used in an embedded cryptographic device, to is usually very 
small. So all (to— 1)! possibilities could be tested for correctness if necessary. 

The trace-averaging process depends on the randomness of the B input and 
its independence from the A input in order to obtain a result which characterises 
the A input. During the pre-computations, both inputs are powers of the initial 
text C and therefore not independent of each other. However, since 3 is generally 
regarded as an acceptable encryption exponent, we can assume that the powers 
(7b) 

are sufficiently independent of C when i contains an odd divisor. Then the 
traces tri should be acceptable for every i which is not a power of 2. Assuming 
also that problems with powers of 2 decrease as the power increases, only traces 
for the exponent digits 1 and 2 might display dependency problems. 

For digit 1, the power trace for (7*^^^ = (7x(7 modM depends on both argu- 
ments. We present two solutions to this. First, one can expect to identify which 
subtraces corresponding to the digit products ax a. They can be excluded from 
the averaged trace to obtain a new trace which at each point depends on a sin- 
gle digit of C and some other effectively independent, random digits. Such a 
revised trace behaves like the other averaged trace functions. Alternatively, we 
may assume to > 2 since if to = 2 there is nothing to decide: all the non-zero 
exponent digits must be 1. Each product (7^*+^^ = (7*^*^x(7 modM involves C as 
the second argument rather than the first. Thus, for any one of these multiplica- 
tions, one can average the traces in a different way, this time summing over the 
different first digits while the second is kept fixed, rather than vice versa. Then 
for TO > 2 the last such multiplication gives an alternative to the initial squaring 
used in the first method for providing a trace for (7. If to > 4 then the remarks 
above about 3 as an encryption exponent establish that the two arguments are 
effectively independent when the last multiplication is used for a trace for (7. 
However, if to = 4 then this multiplication is the product of (7 and (7^^^ and 
there may be cause for concern. We remark on this potential problem next, but 
otherwise it is reasonable to assume that a typical trace can be obtained for the 
class of the exponent digit 1 from the pre-computation multiplications. 
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The trace associated with digit 2 is derived from the product of and C. 
Using the second alternative above, this may also be the source of the trace as- 
sociated with digit 1. However, the dependence between these arguments should 
be very weak since an essentially random multiple of M has been subtracted 
from to obtain So a usable trace should also be obtained for digit 2. 

8 Big Mac 

By omitting the cross-checking afforded by comparing multiplications of the 
exponentiation, we obtain the Big Mac attack in which exponent digits are de- 
termined independently, as in the simulation section. Each trace tVs from a 
multiplication in the exponentiation is compared with each trace tvi from the 
pre-computations and the nearest is selected to determine the exponent digit 
at position s. When no pre-computation trace is close to tVs then digit m— 1 
(for which there is no pre-computation trace) is assigned. All t exponent dig- 
its can then be recovered in t times the time required for recovering one digit. 
Moreover, apart from pre-computations, only the power trace for a single multi- 
plication is used to recover a single exponent digit. So t times the data, i.e. the 
whole exponentiation record, is required to recover all digits. 

More precisely, suppose k and m are fixed and, as usual, t « nk/ \ 0 g 2 m. 
We are interested in what happens when the bit length nk of the arguments is 
varied. For each long integer multiplication the number of /c-bit multiplications is 
But, for a common level of accuracy, all averaged traces could be compiled 
from a fixed number of these digit-by-digit multiplications which is independent 
of t. This would use only constant data per exponent digit and consequently 
0{f) data for the whole attack. If the full quantity of data is used, the traces tr^ 
become more accurate as t (or nk) increases. Furthermore, if every pair (si, s2) is 
considered, then more cross-checking is possible as t increases. Hence, the attack 
becomes much more viable for larger keys! 

9 Using a Set of Exponentiations 

The method of attack described so far has been developed from the power trace 
associated with a single exponentiation. It depended on a reasonable separation 
between the powers of the initial input C when measured using the Euclidean 
metric on the associated vectors of digit Hamming weights. If any powers of 
C are too closely related the attack may fail to work. However, one could wait 
patiently for an input C where the Hamming weights of the pre-computed powers 
are sufficiently widely spread. For large n with small m, this should not take long. 

To benefit from traces from a set of exponentiations, it is important not to 
average the traces. Instead, if the exponent is the same in each case, the sub- 
traces for each multiplication need to be concatenated to provide longer vectors 
for comparison. Alternatively, an observation matrix can be constructed with a 
row for each exponentiation and a column for each exponent digit index, and 
containing the best estimate for the exponent digit. Repeated use of the same 
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digits at the same exponentiation points then leads to corresponding correlations 
between columns of this matrix. Standard statistical techniques should then 
reveal the exponent. 



10 Some Final Details 

10.1 Separating Squares and Multiplies 

Finally, we consider some detail which, for the sake of simplicity, was left out 
of the above arguments. The first concerns differentiating squares from multi- 
plies. The simulation section noted that squares behaved like multiplications 
by having no nearest multiplier. Therefore using distances from pre- 

multiplication traces to classify all long integer operations will place both these 
types in the same bucket. Since each multiplication must be preceded and fol- 
lowed by r squarings, the determination of which is which should be straight- 
forward. Moreover, the multiplications by should all be close to each 

other, whilst the squarings should not. Indeed, this also enables the case m = 2 
to be cracked. Thus, if the attack separates the different multipliers, it certainly 
also separates the squares. 

10.2 Initial Exponent Digits 

The next omission relates to the initial few multiplications of the exponentiation 
after any pre-computation has taken place. The first value assigned to P in 
the exponentiation algorithm of §2.1 corresponds to the first (non-zero) digit 
of d and involves no multiplication. Hence the method here appears to yield 
no information about it. Thus there may be m— 1 times more possibilities for 
d than estimated above, one for each choice of the first non-zero digit of d. 
This is followed by r squarings. The first is of However, a trace for 

can be extracted in the same way as described in §7 for obtaining a trace of 
C from computing This should reveal di using the usual nearest bucket 

method. Once the multiplications for P do start, the B argument of the modular 
multiplication is generally no longer sufficiently closely related to influence the 
power trace adversely. The attack will therefore work successfully from this point 
on. The only noticeable exception is the first multiplication (as opposed to a 
squaring) when m = 2 and the second digit of d is 1. 



10.3 Zero Multiplier Digits 

The last concern is if zero digits (base r) occur in the inputs to a modular 
multiplication and optimization causes the associated digit multiplications to 
be skipped. To avoid timing attacks, this should probably not occur. However, 
with typical values such as r « 2^^, n « 2^, to = 4 and t « 2® for 1024-bit 
keys, we have about mn = 2"^ digits among the pre-computed powers, and about 
nt{m—l)/m = 1.5x2^^ digits among the arguments B of the multiplications 
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during an exponentiation. So the chances of encountering a digit 0 are small 
(« nt/r). In the unlikely event of a zero, the analysis should become much 
easier. If the zero digit lies in a pre-computed power, timing analysis immediately 
reveals which multiplications use that power. Otherwise the zero digit occurs in 
the B argument of a multiplication and one simply defines tr^i by averaging the 
traces over the non-zero digits of B. At worst, another decryption trace might 
be obtained to avoid the problem altogether. 

10.4 Chinese Remainder Theorem 

Implementations using the Chinese Remainder Theorem can be attacked in the 
same way because having a single digit multiplier forces the two exponentiations 
to be performed sequentially. The two exponents are then recovered one after 
the other in the way described above, yielding the secret key. 

11 Conclusion 

An unknown plaintext DPA attack on a single RSA exponentiation has been 
described where the implementation uses a single fc-bit multiplier. This may well 
prove successful, particularly against a RISC processor where no other operations 
can be carried out to mask the multiplier’s use of power. The attack becomes 
easier to perform accurately as the key length is increased because more useful 
data is available. For fixed k and using all available data, the running time is 
proportional to the key length cubed. 

The attacker waits for a sufficiently helpful exponentiation, and then uses 
a careful and novel selection and combination of sections from a single power 
trace to recover secret decryption keys. If the same exponent is reused the attack 
becomes easier. Blinding keys is no defence if the attack succeeds on a single 
exponentiation. Then other methods are required. One solution might be to keep 
a processor/co-processor architecture where the two processes mask each other. 
Alternatively, a pipelined fc-bit multiplier with several stages might be used, 
or CRT performed with the exponentiations using two separate multipliers in 
parallel. Yet another solution might be to use a systolic modular multiplier [15] 
where many unrelated digit multiplications are computed in parallel. 

Certainly one concludes that performing a single, digit-level operation at one 
time, such as a multiplication, leads to a potentially unsafe implementation of 
the RSA cryptosystem. 
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Abstract. Very few countermeasures are known to protect an exponen- 
tiation against simple side-channel analyses. Moreover, all of them are 
heuristic. 

This paper presents a universal exponentiation algorithm. By tying the 
exponent to a corresponding addition chain, our algorithm can virtually 
execute any exponentiation method. 

Our aim is to transfer the security of the exponentiation method be- 
ing implemented to the exponent itself. As a result, we hopefully tend 
to reconcile the provable security notions of modern cryptography with 
real-world implementations of exponentiation-based cryptosystems. 

Keywords. Implementation, exponentiation, RSA cryptosystem, discrete 
logarithm, side-channel attacks, simple power analysis (spa), addition 
chains, provable security, smart-cards. 



1 Introduction 

The security of a cryptosystem is evaluated as the latter’s ability to resist at- 
tacks in a given adversarial model. It is very challenging to guess the strategy 
the adversary will follow in an attempt to break the system. So, the only as- 
sumptions made by modern cryptography refer to the computational abilities of 
the adversary [6]. Loosely speaking, a cryptosystem is then said secure if there 
is no polynomial-time adversary able to gain more “useful” information than a 
honest user by deviating from the “prescribed” behavior. 

In [9,11], Kocher et al. launched a new class of attacks: the so-called side- 
channel attacks. In such a scenario, an adversary monitors some side-channel in- 
formation (e.g., power consumption) during the execution of a crypto-algorithm 
and thereby may foil the security of the corresponding “provably secure” cryp- 
tosystem. So what does provable security mean? The security is usually proven 
by reduction: one shows that the only way to break the cryptosystem is to break 
the underlying cryptographic primitive (e.g., the RSA function). Since this is 
assumed to be computationally infeasible, the cryptosystem is declared secure. 
A side-channel attack does not violate this assumption, it just considers other 
directions to break the cryptographic primitive. Consequently, we stress that the 
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notions of provable security, or more exactly provable computational security^ are 
very useful and must be part of the analysis of any cryptosystem. 

Unfortunately, there is no counterpart to side-channel attacks. Defining a 
security model for this class of attacks seems unrealistic since we do not see how 
to limit the power of the adversary. The best we can hope to prove is the security 
relative to one particular attack. 

This paper focuses on modular exponentiation (e.g., the RSA function or the 
discrete logarithm function) as a cryptographic primitive. Using a representation 
with addition chains, we “transfer” the security of the exponentiation method ac- 
tually implemented in the exponent itself (which is the secret data) . The resulting 
algorithm, which we call universal exponentiation algorithm, works with virtually 
all exponentiation methods. It simply reads triplets of values (7(f) : a{i),l3{i)), 
meaning that the content of register R[a(i)] must be multiplied by the content 
of register R[l3{i)] and that the result must be written into register i?[7(t)]. We 
provide in this way a kind of reduction. Instead of carefully analyzing a specific 
exponentiation method, the implementor simply verifies that the atomic opera- 
tion i?[7(f)] ^ R[a{i)] ■ R[l3{i)] does not leak any “useful” information through 
a given side-channel attack. This methodology is reminiscent of the traditional 
security proofs. In the traditional case, the security of a cryptographic primitive 
is conjectured (e.g., inverting the RSA function is infeasible) whereas in our case 
the security of an atomic operation is assessed through experiments (e.g., I can- 
not “break” a multiplication by SPA). The main difference is that the security 
assumption is scrutinized by fewer people and hence is more controversial. 

The rest of this paper is organized as follows. The next section recalls the 
definition of an addition chain. Based on it, we then present our universal expo- 
nentiation algorithm. In Section 3, we discuss the merits of our approach from a 
security viewpoint. Section 4 suggests some modifications to our basic algorithm. 
Finally, we conclude in Section 5. 

2 Universal Exponentiation Algorithm 

2.1 Addition Chains 

We start by a brief introduction to addition chains. For further details, we refer 
the reader to [8]. 

Definition 1. An addition chain for a positive integer d is a sequence C{d) = 
{d^^\d^^\ . . . , d^^^} satisfying 

1. = 1, d^^'^ = d, and 

2. for all 1 < i < £, there exist j{i),k{i) < i such that -I- 

Integer £ defines the length of chain C. An addition chain is called a star-chain 
if for all 1 < z < t' there exists k{i) < i such that d^*^ = -I- 

A slightly more general notion is that of addition-subtraction chains. 

Definition 2. An addition-subtraction chain for an integer d is a sequence 
C{d) = {d^°\ d^^\ . . . , d^^^} satisfying 
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1. = 1, = d, and 

2. for all 1 < i < i there exist < i such that d^^'> = ± 

2.2 A Universal Algorithm 

Let C{d) = . . . , d^^^} be an addition chain for exponent d. So for all 

1 < i < t', we have + This provides an easy means to evaluate 

y = x'^: For i = 1 to £ compute 

and then set y = \ So, from an addition chain of length i, £ multiplications 

are required to compute y. 

Example 1. An addition chain for 5 is C(5) = {1,2, 3, 5} and so = x, x"^ = 
x^ ■ x^, x^ = x^ ■ x^, and finally x^ = x^ ■ x^. 

At step i, is evaluated as Assuming that 

and respectively belong to registers R[a{i)] and R[l3{i)] and that the 

result, x'^'' \ is written in register i?[ 7 (i)], exponent d can be represented by the 
register sequence 

r{d) = { (7(i) : a{i),P{i))}^^^^^ , ( 1 ) 

meaning that d?[ 7 (i)] = -R[a(i)] • R[P{i)]. (By convention, the value d = 1 is 
represented by F(l) = 0 .) 

From this, we obtain the following exponentiation algorithm (for d > 1): 



Input: x,r{d) 

Output: y = x'^ 

-R[a(l)] <— x\ i?[/3(l)] t— X 
for i = 1 to £ do 

i?[7(*)] ^ R[a{i)] ■ R[0{i)] 
return i?[7(f)] 

Algorithml. Universal exponentiation algorithm. 



Note that i?[a(l)] and i?[/3(l)] are initialized to x because the second item of 
each addition chain is always d^^^ = 2. Note also that one may have a(l) = /3(1). 

For star chains, we have d*^*^ = Therefore pairs are sufficient 

to represent d: a{i) = 7(1 — 1 ) for all 1 < z < £ and can be omitted from the 
representation. Hence, we have the star register sequence 

r*(d) = {(7(z):/3(i))}i<,<, . (2) 

The corresponding exponentiation algorithm is: 
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Input: X, r*{d) 

Output: y = x'^ 

i?[7(0)] ^ i?[/3(l)] ^ a: 

for i = 1 to £ do 

R['y{i)] ^ i?[7(i - 1)] • R[P{i)] 
return -R[7(£)] 

Algorithm2. Universal star exponentiation algorithm. 



3 Towards Provable SPA-Resistance 

The ultimate goal of smart-card manufacturers is a proof that their implementa- 
tions are resistant to side-channel analysis. In this paper, we adopt the method- 
ology of modern cryptography towards this goal. 

Take for example the encryption scheme rsa-OAEP [3]. The minimal secu- 
rity requirement for an encryption scheme is one-wayness (OW). This captures 
the property that an adversary cannot recover the whole plaintext from a given 
ciphertext. In some cases, partial information about a plaintext may have disas- 
trous consequences. This notion is captured by semantic security or the equiva- 
lent notion of indistinguishability [7]. Basically, indistinguishability means that 
the only strategy for an adversary to distinguish between the encryptions of any 
two plaintexts is to guess at random. The strongest attacks one can imagine (at 
the protocol level) are the so-called adaptive chosen- ciphertext attacks (CCA2). 
Those attacks consider an active adversary who can obtain the decryption of any 
ciphertext of her/his choice. From the pair of adversarial goal (IND) and adver- 
sarial model (CCA2), we derive the security notion of IND-CCA2. In an IND-CCA2 
scenario, an adversary has access to a decryption oracle. S/he first outputs a pair 
of plaintexts toq and mi. Then, given a challenge ciphertext Cb which is either 
the encryption of mo or mi, the adversary has to guess with a probability non- 
negligibly better than 1/2 if C{, encrypts mo or mi. The attack is called adaptive, 
if after receiving the challenge Cb, the adversary may still obtain decryptions of 
chosen ciphertexts, the only restriction being not to probe on C{,. 

In [3], Bellare and Rogaway remarkably proved that if an adversary is able 
to break the IND-CCA2 security of rsa-OAEP then the same adversary is able 
to break the OW security of the RSA function, that is, to compute an e**' root 
modulo a large composite number N = pq (where typically p and q are 512-bit 
primes). Since the latter is assumed infeasible, rsa-OAEP is declared provably 
secure. We note that their proof only holds in the random oracle model [2], i.e., an 
ideal world where hash functions behave like random functions. To summarize, 
the security of rsa-OAEP is proven by 

1. identifying the security goal and the adversarial model (i.e., IND-CCA2); 

2. defining the working hypotheses (i.e., random oracle model); 

3. exhibiting a reduction (i.e., breaking the IND-CCA2 of rsa-OAEP =7 breaking 
the OW of the RSA function); 
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4. assuming that the reduced problem is intractable (i.e., inverting RSA is in- 
feasible); 

5. deducing the security notion (i.e., IND-CCA2 security of rsa-OAEP in the 
random oracle model). 

The security of rsa-OAEP is at the protocol level. To break the IND-CCA2 
security, the adversary has a black-box access to a decryption oracle: s/he knows 
the input and obtains the corresponding output. In the case of side-channel 
attacks, the adversary is more powerful: s/he gets access to some internal states 
of the computation. 

So, by monitoring the power consumption of an RSA exponentiation, an at- 
tacker is even sometimes able to recover the secret decryption exponent d used 
in the computation of y = x‘^ mod N and the OW assumption of the RSA func- 
tion is no longer valid. Suppose for example that the RSA function is naively 
implemented with the square-and-multiply method. As shown in the next fig- 
ure, the exponent can then be recovered very easily: a lower consumption level 
corresponds to a squaring and a higher consumption level corresponds to a mul- 
tiplication. 




Fig. 1. Power trace of a square-and-multiply exponentiation. 



In our simplified model, we consider the fundamental security goal of un- 
breakability (UB). A cryptosystem is said unbreakable if it is infeasible to recover 
the secret key. This kind of attack is usually referred to as a total breaking. 
We also consider an attacker who has access to some side-channel information. 
Depending on the side-channel information and the way it is treated, we de- 
fine several adversarial models. In the simple power analysis (SPA) model, an 
attacker acquires the power trace of a single execution of the crypto-algorithm. 
From this, we derive the security notion of UB-SPA. Likewise, one can define the 
UB-DPA (dpa stands for differential power analysis [11]) and so on; one can also 
consider other security goals and derive security notions like OW-SPA or IND-SPA. 
It is worth noting here that, contrary to modern cryptography, the definition of 
an adversarial model is not absolute: in a CCA2 attack, an adversary obtains 
the plaintext corresponding to a chosen ciphertext whereas in an attack like a 
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SPA, the “quality” of the returned information depends on the acquisition tools 
among other things. 

Concentrating on the exponentiation function and more particularly on the 
RSA function, one can show that if an adversary is able to break the UB-SPA 
security of the universal exponentiation algorithm, s/he is also able to invert the 
RSA function. (We note that the main threat for an RSA exponentiation is the 
SPA; for DPA, efficient counter-measures are known.) In order to break the UB- 
SPA, an adversary must be able to gain some secret information from the basic 
operation i?[ 7 (z)] ^ R[a{i)] ■ R[f3{i)] by SPA, that is, s/he must be able to, at 
least, differentiate among the triplets ( 7 ( 1 ) : a{i),l3{i)) and to recover all their 
values to break the UB property. Assuming that the latter is infeasible (this can 
be verified experimentally), one has strong evidence^ that the universal expo- 
nentiation algorithm resists to SPA. As a conclusion, if rsa-OAEP is implemented 
with the universal exponentiation algorithm, we have strong evidence that it re- 
sists SPA attacks. Note here that the security is assessed at the implementation 
level. 

From a security viewpoint, the advantage of our method is evident. It reduces 
the problem of scrutinizing any exponentiation algorithm to that of the simpler 
operation R['^{i)] ^ i?[a(t)] • R[f3{i)]. This makes the job of the implementor 
a lot easier since s/he has a better knowledge of the sensitive parts of her/his 
algorithm. Moreover, the security passes from a macroscopic level (a software 
exponentiation) to a microscopic level (a hardware multiplication). Finally, the 
analysis must be done once for all and remains valid whatever the exponentiation 
algorithm underlying a given T-representation. 

Remark 1. In some ways, to relax the assumption the universal exponentiation 
algorithm is UB-SPA, one can always randomly add dummy operations at the 
expense of a longer running time (e.g., to add to a F representation, a triplet 
that does not affect the final result). One can also exploit the property that 
R[j{i)] ^ i?[o;(i)] • R[P{i)] and R[j{i)] ^ R[f3{i)] ■ R[a{i)] both lead to the same 
result. Another solution consists to randomly permute the order of the registers 
and their values during the course of the exponentiation. 

In addition to simplifying the security analysis, our universal exponentiation 
algorithm has the following features: 

— it is simple: its implementation is straightforward and so programming errors 
are likely avoided; 

— it is flexible: owing to the genericity of the F-representation, it can virtually 
execute all exponentiation algorithms; 

— it is fast: contrary to the protected square-and-multiply method (a.k.a. 
square-and-multiply-always method) which requires 2 log 2 d multiplications 
for computing y = x‘^, our algorithm may require as few as 1.25 log 2 d mul- 
tiplications (cf. § 4.1); 

^ In contrast with modern cryptography, we cannot say that we have a proof of security 
because as aforementioned this depends on the quality of the experiments. 
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~ it is economic: if the exponentiation algorithm underlying a _T-representation 
happens to be flawed, it is enough to correct the /^-representation: a complete 
re-programming is unnecessary. 

The last property is especially interesting for a smart-card implementation. 
The program code is usually stored in ROM memory via an expensive process 
called masking and the secret key (e.g., the RSA decryption exponent d) is stored 
in EEPROM memory at the personalization stage. So in case of secret leakage or 
mis-programming, one has just to change or correct the T-representation of the 
secret exponent. 

4 Practical Considerations 

If we want to realize a smart-card implementation of the proposed algorithms 
(Algorithms 1 and 2), we face some constraints. A smart-card has a limited 
number of registers and so we need a way to produce T-representations with 
a predetermined number of registers. Moreover, a T-representation with fewer 
registers requires fewer memory for its storage. Another difficulty may occur 
when the secret exponent is generated outside the card by a third party because 
it is given in its binary representation. 

In this section, we suggest two different approaches that alleviate the above 
limitations. 

4.1 On-Line Generation 

A straightforward solution is to produce a /^-representation on-line, i.e., by the 
smart-card itself. Several good heuristics are known for producing relatively short 
addition chains. In [12], Walter suggests the following method to compute y = x‘^ 
(see also [4]). 

Define do = d, xq = x, and yo = 1. Next, at each step, write di = rrii di^i + Xi 
for appropriately chosen values for {mi,ri). Hence, letting Xj+i = Xj'"* and 
yi+i = Xi'^^yi, we get 

y = yo = (xo’”°)‘^i (xo’'° yo) 

= yi = (a;i’'i yi) 

= X2'^'^ J/2 = (X2’'^ 2/2) 

= x/^ 2/3 = • • • 

The idea behind Walter’s method is to And pairs {rrii, r^) so that the evaluations 
of both Xi’”* and Xi^* are inexpensive. This is the case when lies in the addition 
chain used to evaluate 

Such a method is very well suited to a smart-card implementation. It is easy 
to implement and the corresponding register sequence, r(d), requires only one 
more register than the standard square-and-multiply method. Furthermore, the 
average length of r{d) is only 1.251og2(i, with a very small deviation. See [12] 
for details. 
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Note that the computation of r{d) must be performed in a secured environ- 
ment since its disclosure reveals the value of secret exponent d. For example, 
this can performed at the personalization of the card. 



4.2 Exponent Splitting 

The second solution we propose relies on the simple observation that 

x‘^ = x^- (3) 

for some a. The idea of splitting the data was already abstracted in [5] as a 
general countermeasure against differential power analysis attacks. We note that 
the values of both a and (d — a) are required to recover the value of d. In other 
words, only one exponentiation, or x‘^~°‘, needs to be secured. 

Given a register sequence for a, r{a) or T*(a), we can compute y' = 
and d' = d — a, and so x‘^ = y' ■ x'^ . There are two possible alternatives. The 
first one is, for a given a, to store a chosen (and thus fixed) register sequence, 
r{a), during the personalization of the card. (In this case a star representation, 
r*{a), may be preferred since it requires fewer memory.) The advantage of this 
approach is that this imposes the underlying methods for computing y' = 
and d' = d — a. 

Another alternative consists in randomly computing a register sequence, r{a) 
or r*{a), for a “on the fly”. The advantages of this second approach are twofold. 
First, no register sequence needs to be stored in non-volatile memory and so this 
results in some memory savings. Second, the methods for evaluating y' = x°“ and 
d' = d — a differ at each execution. Independently, this randomization also helps 
to prevent differential attacks like the dpa. 



5 Conclusion 

In this paper, we presented an universal exponentiation algorithm. Through 
the notion of register sequence, r{d) = {( 7 (f) : a{i), built from 

addition chains, we explained how this helps to protect an exponentiation-based 
cryptosystem against simple side-channel attacks like SPA. Assuming that a more 
atomic operation (i.e., the multiplication of registers R['^{i)] ^ R[a{i)] ■ R[l3{i)]) 
does not leak secret information, we “proved” the security of our implementation. 
There is no secret at all involved in our universal exponentiation algorithm: the 
secret exponent d is intimately tied to r{d) and recovering the value of d supposes 
the recovery of the whole sequence R(d), which is a contradiction. Furthermore, 
our algorithm can be trivially implemented and it greatly simplifies the security 
analysis since the critical (i.e., sensitive) parts are better understood. 

As a final conclusion, we hope that this first step towards provable security 
of real-world implementations will be a motivating starting-point for further 
research in this very important subject. 
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Abstract. Since Power Analysis on smart cards was introduced by Paul 
Kocher [7] , many countermeasures have been proposed to protect imple- 
mentations of cryptographic algorithms. In this paper we propose a new 
protection principle: the transformed masking method. We apply this 
method to protect two of the most popular block ciphers: DES and the 
AES Rijndael. To this end we introduce some transformed S-boxes for 
DES and a new masking method and its applications to the non-linear 
part of Rijndael. 

Keywords: AES, Rijndael, DES, Transformed mask, Multiplicative mask. 
Power analysis, DPA, SPA, Smart Cards. 

1 Introduction 

Since Kocher, Jaffe and Jun introduced Differential Power Analysis, many 
countermeasures have been proposed to protect the card against power analysis 
type attacks (SPA, DPA, HODPA): 

~ insertion of dummy instructions; 

— randomization of operations; 

— transformation of the data (i.e. Duplication Method [6]); 

— masking of the data [2,3]: boolean, arithmetic... 

In this paper we present a practical implementation of DES ([10]) and AES 
([5]) using some of these countermeasures combined with new methods. We will 
essentially use a new idea -an adapted masking method- combined with a bit- 
per-bit randomization of many operations during the computation. 



2 Transformed Masking Method 

2.1 Principle 

The idea is the following: the message is masked at the beginning of the 
algorithm and after this, everything (or nearly) is as usual. Most of the pre- 
vious proposed methods must respect a masking condition at each step of the 
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algorithm, but here we only need to know the value of the mask at a fixed step 
(for example at the end of a round or at the end of a non-linear part) and we 
reestablish the expected value at the end of the algorithm. 

It is easy to see that the problem of implementing a masking countermeasure 
comes from the non-linear parts of the algorithm. Furthermore the security of a 
symmetric cryptographic algorithm is essentially based in these parts. 

2.2 DES 

In the case of the DES, the most appropriate mask is a boolean mask X 
which is applied before the Initial Permutation IP (we XOR the 64-bit message 
M with a 64-bit value X). The only non-linear part of the DES is the S-Box, 
so when using a masking countermeasure, we use a modified S-Box. This also 
enables us to reestablish the mask value. It is only after the Final Permutation 
FP that the mask will be removed to obtain the right result. 

2.3 AES 

For this algorithm, the method is slightly different. We still use a XOR oper- 
ation as a masking countermeasure, but now the mask is arithmetic on GF(2®). 
This operation is compatible with the AES structure except for the inversion in 
GF(2®). For this, we use a new technique to transform the boolean mask into a 
multiplicative mask. This allows us to keep the same level of security throughout 
the algorithm. As for the DES, the mask will be reestablished at the end of each 
round and the value will be unmasked at the end of the algorithm. 

3 Applications 
3.1 Securing the DES 

DES Structure We want to cipher a 64-bit message M with the DES. We 
choose a 64-bit mask X which will be XOR-ed with the message M at the 
beginning of the DES. Then we start with the value M © A. 

Just before the S-box, it is easy to see (fig.l) that we have an intermediary 
value masked with X2 = FP(Xls 2 - 63 ) where: 

— XI represents the 64-bit value IP{X); 

— IP represents the initial permutation; 

— A1o_3i (respectively Al 32 _ 63 ) represents the 32-bit low-weight (respectively 
high- weight) part of the 64-bit mask X; 

— X2 represents the 48-bit value EP{Xl 32 - 63 )', 

— EP represents the expansive permutation of a DES round. 

The method chosen in the case of the DES is to reestablish the mask A1 at 
each round. To obtain this result, we will use a modified S-box, denoted SM-Box. 
The output of the SM-Box, after the permutation P and after being XOR-ed 
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Fig. 1. Differences between a DES without countermeasure and a DES with masking 
countermeasure. 
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with the left part of the message, must have a mask corresponding to Ari 32 _ 63 - 
So, the SM-box is now defined by: 

SM-Box{A) = S-Box{A © X2) © p-i(Xlo_3i © ^ 132 - 63 ) (1) 

where P~^ is the inverse of the permutation P applied after the S-Box. 

It is also necessary to modify the left part of the message, we XOR it with 
Allo -31 © A'132-63- So, at the end of the round, the mask All will be preserved. 

After the last round, the two 32-bit parts are interchanged, so it will be 
necessary to XOR these two parts with the 32-bit value Allo -31 © Al 32_63 
before the final permutation FP. The correct cipher is obtained after the final 
permutation by unmasking the value with the 64-bit mask X. 

To sum up, the following scheme represents the differences between a DES 
without countermeasure and a DES with masking countermeasure. Operations 
added to implement the masking countermeasure are represented with dotted 
lines. 



Mask Preparation It is important to note that the preparation of the different 
values used as mask during the cipher {X, XI and X2) must be computed in 
a very secure way. Indeed, if an attacker succeeds in finding X, XI or X2 then 
he will be able to break our countermeasure. So the method used is to compute 
these values with a randomized bit-per-bit calculation. This computation is slow 
but is done only once at the beginning of each DES and guarantees high-level 
security. In the worst case, the attacker will learn only the Hamming weight of 
the mask. 



Put into Practice We have implemented this countermeasure twice. The first 
implementation was done entirely in C using a 32-bit rise CPU and the sec- 
ond was done in assembly code using another 32-bit rise CPU with specialized 
assembly instructions to facilitate a DES implementation. 

For the first implementation we obtain the following results: 



Type of DES 


Timing at 5 Mhz 


Space of ROM in bytes 


Space of RAM in bytes 


Normal DES 


9.4 ms 


1540 


42 


DES with CMl 


18.6 ms 


2660 


187 


DES with CM2 


21.2 ms 


2656 


452 



Fig. 2. Timings and memory space used for non-optimized C code. 



And for the second implementation we obtain: 





An Implementation of DES and AES, Secure against Some Attacks 



313 



Type of DES 


Timing at 5 Mhz 


Space of ROM 


in bytes 


Space of RAM in bytes 


Normal DES 


46.2 jj,s 


596 




16 


DES with CM2 


237.6 ys 


2017 




272 



Fig. 3. Timings and memory space used for assembly code. 



Where: 

— Normal DES : a non-optimized implementation without countermeasure. 
This DES served as a basis for the construction of the “secure” implemen- 
tations, 

— CMl : classical countermeasure (cf. [1]) where the message or its complement 
is ciphered, 

— CM2 : DES with the masking countermeasure on the message and with 
randomization. 

3.2 AES 

For the AES algorithm, the method is close to the one used for the DES when 
we want to secure the affine and linear parts of the algorithm: we simply keep 
the same mask at each round. Next, we show how to deal with the non-linear 
parts. 

The first part of an AES round is the ByteSub transformation which is the 
only non-linear part of the AES. It is an S-Box which is the composition of two 
transformations (a multiplicative inversion in GF(2®) and an affine transforma- 
tion /) applied on each byte of the input A\ 




Fig. 4. The ByteSub transformation. 



With the masking countermeasure, we want to obtain the following scheme 
where is the 8-bit value which masks Ai^ and Xlij = f{Xij) © 0x63 (this 
comes from the affine property of /): 




Fig. 5. The ByteSub transformation with masking countermeasure. 

We must resolve the following problem: how to obtain Ai^~^ ® when we 
have Aij © A^j without compromising the 8-bit value Aij. 
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Fig. 6. Modified inversion in GF(2®) with masking countermeasure. 



The first idea is, like in the DES algorithm, to use a modified S-Box computed 
each time we start an AES. However, in the AES case, the size of the table goes 
from 256 bytes to (256 * 16) bytes, equal to 4 Ko when we choose a 128-bit 
message. This solution is not possible when working in a smart card environment 
(this table is dynamic and must be located in RAM). 

The approach selected is the following: an operation compatible with inversion 
is multiplication, so we obtain the trivial formula: The 

problem lies in transforming a boolean mask into a multiplicative mask. If we 
denote by Yi j an 8-bit random different from zero and by 0 the multiplication 
in GF(2®) using the irreducible polynomial m{x) = + x'^ + + x + 1 as 

modulus, the mask transformation is obtained as follows: 
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During all stages of the transformation boolean mask / multiplicative mask, 
intermediary values are independent of Ai^: 

1. we multiply with a non-zero 8-bit random Yij, 

2. and we XOR with Xij ®Yij. 

After the inversion in GF(2®) we have a multiplicative mask and to reestablish 
the boolean mask we use values independent of Aij: 

1. we XOR with Xi^ O Yi^~^ , 

2. and we multiply with Yi j. 

Now, let us see the difference between a round of the AES without counter- 
measure and a round with masking countermeasure (fig. 7). 




Fig. 7. The round i of the AES without and with masking countermeasure. 



Where: 
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— X represents the mask applied; 

— XI = fl{X) where fl is the linear part of the affine transformation / of 
ByteSub; 

— X2 = ShiftRow{Xl); 

— X3 = MixC olumn{X2)-, 

— Ki represents the round key i. 

With this, it is possible to compute an AES and to keep the same random 
mask at each round. 



Put into practice. The following timings come from the encryption of a 128- 
bit message using a 128-bit key. The implementation was done in assembly code 
using a 8-bit CPU. 



Type of AES 


Timing at 5 Mhz 


Space of ROM in bytes 


Space of RAM in bytes 


Normal AES 


18.1 ms 


730 


41 


AES with CM2 


58.7 ms 


1752 


121 



Fig. 8. Timings and memory space used for assembly code. 



Where: 

— Normal AES : a non-optimized implementation without countermeasure, 

— AES with CM2 : AES with masking countermeasure. 

4 Some Security Considerations 

4.1 Against SPA 

All the permutations have been randomized so (presumably) the only thing 
the attacker can read is the Hamming weight (HW) of the permuted value. 
Moreover the values are masked meaning that an attacker would need to know 
the value of the mask too. 

During the key scheduling of DES, the attacker could get the HW of the round 
keys Ki, construct linear equations on the 16 rounds and acquire information 
about the key. To avoid such problems the bits of the entire key (56 bits or 
even 64) are processed at each round. Therefore the only information that the 
attacker could obtain is the HW of the key. 

During the other operations (XOR, load, store), we think that the 32-bit 
operations prevent an attacker from being able to get the precise 32-bit value 
that is loaded in one operation. But due to high-order DPA (HODPA) it would 
be better to randomly load, store and XOR these values bit-per-bit. 

Of course, at the beginning of the DES while computing XI and X2, one 
should be careful and use randomization too. It is still true for the AES; moreover 
the operations involved in the AES are well adapted to usual processors and the 
problem of the bit-permutation does not exist here. 
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4.2 Against DPA 

We will not present the general DPA attack but simply consider the following 
fact: an ordinary DPA attack is based on the prediction of one intermediate value 
of the computation during the algorithm. Based on this fact, it seems that our 
implementation is fully protected against such a -simple- attack. Indeed, from 
the beginning of the DES (input message) to the end of it (cipher text) none of 
the “real” intermediate values appear. 

Due to the general random mask, we are not vulnerable to the kind of attacks 
described in either [4] or [1]. Indeed, at each computation, unless one knows 
the mask used, the output of the non-linear part (S-Box for DES / ByteSub 
transformation for the AES) is random and not just masked by 0x00 or OxFF. 

Fundamentally we are subject to second-order DPA (see [8,9]) due to the 
method of masking. But if we consider a real high-order DPA, some other aspects 
appear: due to the randomization of all operations, the place, (i,j) for example, 
showing correlation in a HODPA attack, changes a lot. Indeed, in the general 
case the value i and j will both have 32 possibilities (if we consider that the bit 
in question is in a 32-bit value). Therefore this gives 1024 positions for the DPA 
peak and considerably increases a “normal” HODPA. 



5 Conclusion 

We have described some new ideas for a practical implementation of DES and 
AES: adapted mask, modified S-Box, transformation boolean mask / multiplica- 
tive mask. As is seen from the timings of our implementations, these countermea- 
sures against SPA and DPA can be implemented in a smart-card environment 
where the memory space is restricted and the processor speed is slow. 
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Abstract. A standard model of stream cipher combines the outputs of 
several independent Linear Feedback Shift Register (LFSR) sequences 
using a nonlinear Boolean function to produce the key stream. Here we 
present a low cost hardware architecture for such secret-key cryptosys- 
tems using a relatively large number of LFSRs. We propose implemen- 
tation of the LFSRs using Cellular Automata in VLSI. This provides a 
regular and uniform two dimensional array of flip flops with only local 
interconnections. The main bottleneck in the implementation of stream 
ciphers using a relatively large number of LFSRs is the implementation 
of the combining Boolean function. We show that this bottleneck can be 
removed and it is feasible to implement “large” cryptographically secure 
Boolean functions using a reconfigurable pipelined architecture. 
Keywords : Stream Ciphers, Boolean functions. Linear Feedback Shift 
Registers, Cellular Automata, Reconfigurable Hardware, Pipelined Ar- 
chitecture. 



1 Introduction 

In the most common model of stream ciphers, the outputs of several independent 
Linear Feedback Shift Registers (LFSRs) are combined using a nonlinear Boolean 
function (see Figure la). The initial conditions of the LFSRs constitute the secret 
key of the system. In Figure lb we provide an example of an LFSR. Here the 
recurrence relation is = 6„_2©6n-5©6„_6. The initial condition in the LFSR 
is 6564&362&160. After the first step, the output of the system is the bit bo and 
the new bit be = 64 © 61 © 6q. See [ 3 ] for more details about LFSR. In such a 
system, n bits from the n different LFSR’s are generated at each clock. These 
n bits are provided as n input values to the combining function. That is, the 
LFSRs provide the input bit streams Xi,X2, ■ ■ ■ , A„ to the combining Boolean 
function /. The output of the combining function is the key stream (AT) which 
is XORed with the message stream (M) to obtain the cipher stream (C). 

The combining Boolean functions must possess certain cryptographic prop- 
erties for the overall system to be secure. Design of proper Boolean function have 
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(a) LFSR based encryption scheme {b) LFSR : One step evolution 
Fig. 1. Stream Cipher System 



received a lot of attention in recent times as evidenced by the papers [4,8,10,12]. 
This has answered many theoretical questions on the design of Boolean functions 
for stream cipher applications. It is now time to turn to the implementation issues 
of such Boolean functions and their actual use in stream cipher cryptography. 

LFSR based stream cipher systems are usually implemented using a Boolean 
function on a small number of variables, typically 8 to 10. The main reason being 
the difficulty in efficiently implementing a Boolean function on a large number of 
variables (say 20 or more variables). However, if one were to use such a function 
with properly selected parameters, then none of the currently known attacks 
would have even a remote chance of success. 

The VLSI area used in implementing stream cipher systems have two com- 
ponents. 

1. The area used to implement the LFSRs. 

2. The area used to implement the Boolean function. 

Suppose the system uses an n-input Boolean function and (for simplicity) assume 
the length of all the LFSRs are same (say L). Then the area used to implement 
the LFSRs is proportional to Lxn while the area used to implement the Boolean 
function can be proportional to 2". Consequently, while the area required by the 
LFSRs increase linearly with n, the area required by the Boolean function can 
be exponential in n. Thus, by increasing the number of inputs to the Boolean 
function, the main hurdle would be in implementing the Boolean function and 
not the LFSRs. 

Let us compute some real parameters to get a feel of the problem. Suppose 
a 32-variable combining function is used where the length of the LFSRs is 64 
bits long on average (shortly we will discuss why we will not use equal length 
LFSRs). Then the number of flip-flops required to implement the LFSRs is 
only 2048, while a direct implementation of the Boolean function can require 
area proportional to 2^^. The key size of such an LFSR system is estimated as 
follows. The secret key of the system are the initial states of the LFSRs and 
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hence account for 32 x 64 = 2048 key bits. While this is a large key, it should be 
noted that currently RSA systems are also being advocated with 2048 bit keys. 

In this paper we tackle the implementation issue of Boolean functions on 
a large number of variables. (Here we consider a Boolean function on 24 or 
more variables to be a “large” one.) There is no general purpose implementation 
method and implementation is dependent on the specific design of the Boolean 
function. We present an algorithm and hardware description of the recursive 
construction method presented in [4]. The functions in [4] are built recursively. 
Thus a function F of n variables is built up from a function h oi k variables. It 
is important to note that, if we use a function h which is optimum with respect 
to the parameters algebraic degree, order of resiliency and nonlinearity, then the 
function F is also optimum with respect to these parameters [9]. We describe 
an algorithm which uses the function h as, & black box (an oracle) and computes 
the output of F on an n-bit input in time linear in n — k. The space required by 
the algorithm is 0(1) plus the space required to implement h. 

In an LFSR based stream cipher system, an n-bit input is provided to the 
function at each clock cycle. Thus our algorithm cannot be directly translated 
into a hardware circuit. Instead we use a regular pipelined architecture to map 
the algorithm to hardware. The pipeline takes n — k cycles to fill up and after 
that, it can handle an n-bit input at each clock cycle. There are n — k stages 
to the pipeline which are all similar to each other providing a uniform design. 
Implementation of each stage can be done by a circuit or look up table of constant 
size. The total space required to implement F is the space required to implement 
h plus an additional 0{n — k) size circuit. Usually the number of variables k of 
the function h will be significantly less than the number of variables n of the 
function F , and in our system space required to implement F is of the same 
order as the space required to implement h. This makes it feasible to implement 
functions of 24 or more variables with nominal cost. 

An important parameter is the linear complexity of the generated key se- 
quence. To obtain the maximum possible linear complexity of the key sequence, 
we need to use LFSRs whose lengths are coprime to each other [1]. Thus the 
LFSRs are going to be of different lengths. A direct implementation of such 
different length LFSRs is going to produce a very irregular VLSI structure. To 
obtain a more regular structure, we suggest the use of a uniform two dimensional 
array of flip flops connected in a suitable fashion. Some of the flip flops in the 
two dimensional array will not functional. This is the price to pay for obtaining 
uniformity in the design. 

Now consider implementing the different length LFSRs on this two dimen- 
sional structure. Each LFSR must have a large number of tap cells to resist 
cryptanalytic attacks [15]. Further, each of the LFSRs are going to have a long 
feedback connection. Thus the overall connection pattern on the two dimensional 
array is going to be highly irregular. This is also considered to be a disadvantage 
in VLSI implementation. 

Here we suggest the use of cellular automata (CA) to replace the LFSRs. 
The class of CA we suggest are algebraically equivalent to LFSRs. Hence the 
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Fig. 2. A (90, 150,90, 150) CA 



security of the system is not affected by this change. The advantage would be 
that a CA based design would provide a uniform and regular structure with only 
local interconnections, which is very attractive from VLSI point of view. CA 
based architectures have been proposed for many traditional LFSR applications 
(see [2]). 

In Section 2, we briefly outline the necessary details of CA required to replace 
LFSRs in stream cipher cryptography. The cryptographically useful Boolean 
functions from [4,9,6] are described in Section 3. We summarize the main points 
and gloss over the cryptographic properties since our purpose here is to discuss 
the implementation of these functions. The actual algorithm and hardware de- 
scription is presented in Section 4. Finally we conclude with some remarks on 
future work in Section 5. 

2 Cellular Automata 

A cellular automaton is a finite array of cells, where each cell can store a bit 
of information. The collection of values of the cells constitute the global state 
of the CA, whereas the state of a cell is called its local state. The CA evolves 
globally in discrete time steps, with the state of each cell changing at each time 
step. The change is affected by the values of the two neighbouring cells and also 
optionally itself. This is pictorially depicted in Figure 2. The cell at the left end 
does not have a left neighbour and one at the right end does not have a right 
neighbour. If the next state of a cell depends on its two neighbours and itself, 
then the cell is said to follow rule 150. If the next state of the cell depends only 
on its two neighbours and not on itself then it is said to follow rule 90. (See [16] 
for an explanation and nomenclature of CA rules) . A CA having cells which use 
only rules 90 and 150 is called a 90/150 CA. In the rest of the discussion we 
will be interested in only 90/150 CA. The next state evolution of a CA can be 
totally described by a tridiagonal matrix as follows. Consider a 4-cell CA with 
rules (90, 150,90, 150) (see Figure 2). If the current state is {xo,Xi,X 2 ,xs), then 
the next state (yo, 2/i, 2 / 2 , ya) is given by 

0 10 0 
1110 
0 10 1 
0 0 11 



(xo,Xi,X2,X3) 



( 2 / 0 , 2 / 1 , 2 / 2 , 2 / 3 )- 
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Fig. 3. STD for the CA in Figure 2. 



Thus starting from an initial configuration, the CA evolves in discrete time 
state under the action of the state transition matrix. See Figure 3 for the next 
state behaviour of the (90, 150, 90, 150) CA. The initial configuration is loaded 
in parallel into the CA cells. In our setup this initial configuration is the secret 
key for the CA being used. The output of the CA can be taken as the output 
of any particular cell of the CA. The sequence generated by any cell is same as 
any other cell except for a circular shift in the sequence. Note that unlike the 
LFSRs the amount of shift between two consecutive cells may be more than 1. 

The tridiagonal matrix which governs the behaviour of the CA is called the 
state transition matrix. It is known [2] that if the characteristic polynomial of this 
matrix is primitive over GF(2) then an n-cell CA will cycle through all the possi- 
ble 2” — I non null states. The characteristic polynomial of the (90, 150, 90, 150) 
CA is x^ + a;+ 1, which is primitive over GF{2) and hence the CA cycles through 
all the non zero states as shown in Figure 3. The output sequence of the CA is 
completely determined by the characteristic polynomial of the state transition 
matrix. This is the basis for replacing LFSRs by CA. 

Given a primitive polynomial it is easy to design an LFSR which has this 
polynomial as its connection polynomial. On the other hand the design method 
for CA is not straightforward. One approach is to form the companion matrix 
and then use the Lanczos tridiagonalization over GF{2). This approach has been 
carried out in [11]. However, a simpler and a more elegant algorithm has been 
presented by Tezuka and Fushimi [13]. The matter that interests us here is the 
fact that given any primitive polynomial it is possible to design a CA whose state 
transition matrix has this primitive polynomial as its characteristic polynomial. 

Following the above discussion it is clear that the use of CA does not alter 
the stream cipher system in any essential way and hence the security of the 
system remains unaltered. The only advantage to be gained is the simplicity 
in VLSI implementation. The use of CA over LFSR has been suggested for 
several advantages in VLSI design. The local connection structure of CA makes 
it a regular and cascadable architecture. On the other hand, the long feedback 
connection of LFSRs introduce delays and is also undesirable from a VLSI layout 
point of view (see [2]). Also see [7] for a survey on CA. 
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2.1 CA Based Implementation 

As mentioned in the Section 1, the main difficulties in the implementation of the 
LFSRs are the following. 

1 . To have the maximum linear complexity, the LFSR lengths need to be pair- 
wise relatively prime. A direct implementation would have to use registers 
of different lengths resulting in a non uniform structure. 

2. The connection pattern for an LFSR is highly irregular. The tap points of 
an LFSR are in general not regularly placed. In addition, the number of 
taps in the LFSR must be high to resist against certain types of attacks [15]. 
Further, an LFSR has a long feedback connection and the length of this 
feedback connection can be equal to the length of the LFSR. 

Thus a direct implementation of the LFSRs leads to an irregular and non 
uniform design. This is considered to be a distinct disadvantage in VLSI imple- 
mentation. We discuss how the above problems can be tackled. 

To tackle the first problem, we suggest the use of a two dimensional array 
n X L of flip flops, where n is the number of inputs to the Boolean function and L 
is the maximum degree of a connection polynomial (say 128). In each row of this 
structure, the connection pattern for a single polynomial is implemented. Thus 
in each row, some of the flip flops will not be functional. The cost incurred due to 
this would be offset by the design advantage in using a uniform two dimensional 
array. 

The solution to the second problem is to use 90/150 CA to implement the 
LFSRs. Corresponding to a primitive polynomial we will be able to get the 
corresponding 90/150 CA using the algorithm provided in [13]. Each cell in such 
a CA will be connected to its left and right neighbours. Further if the rule for 
the cell is 150, then it will also be connected to itself. Thus all connections are 
local and regular. Also the long feedback connection of the LFSR is eliminated. 

In the two dimensional array of flip flops, for each row, the number of flip 
flops used is equal to the length of the corresponding CA. The outputs of all 
the CA are taken in a bit slice manner from one end (say the right end) of the 
two dimensional array. In this case the non functional flip flops will be towards 
the left end. Thus the overall design is a two dimensional array of flip flops with 
only local connections and the output is provided in a bit slice manner by the 
rightmost column of the two dimensional array. Such a structure will be simple 
to implement in VLSI and will also provide easy reconfigurability using standard 
structures like FPGA. 

3 Cryptographically Useful Boolean Functions 

We present a brief overview of the various cryptographic properties that a 
Boolean function must satisfy in order to be used for stream cipher systems. 
Since our purpose in this paper is implementation, we briefly mention the prop- 
erties. For more detailed definitions we refer to [4,8]. 
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An n-variable function is said to be balanced if the output column of its 
truth table has equal number of zeros as ones. It is said to be m-resilient, if 
the probability of the output being one is half even if atmost m of the inputs 
are fixed to constant values. The algebraic normal form of a Boolean function is 
its canonical sum of products representation as a multivariate polynomial over 
GF(2). The degree of the polynomial is called the algebraic degree or simply 
degree of the function. Functions of degree atmost one are called affine functions. 
Given a Boolean function, its nonlinearity is its Hamming distance to the set of 
affine function, i.e., its Hamming distance to its best affine approximation. 

A method of designing cryptographically useful Boolean functions is to start 
from an initial good function and recursively build up the desired function. Sev- 
eral such recursive methods have been proposed [4,12]. The method proposed 
in [4] is simple, though it does not always result in the best function. The reason 
being the use of an unbalanced, highly nonlinear initial function which was not 
optimized with respect to nonlinearity, algebraic degree and order of resiliency. 
However, for a suitable initial function, the method of [4] produces optimized 
functions. These initial functions have to belong to one of the saturated sequences 
discussed in [9]. For example, if we use a 7-variable, 2-resilient, degree 4, nonlin- 
earity 56 function [ 6 ], then the resulting sequence of functions constructed using 
the method of [4] are the best possible with respect to nonlinearity, algebraic de- 
gree and order of resiliency. Thus we restrict ourselves to implementation of the 
recursive method of [4], noting that the initial function h must be an optimized 
function with respect to nonlinearity, algebraic degree and order of resiliency. 

Suppose an n-variable function F{X„, . . . ,Xi) is to be used in the stream 
cipher system. Following the method of [4], this F is represented by a sequence 
{h, Si, . . . , St), where h is the initial function of k variables Xf;,...,Xi and 
Si’s are the recursive operators used to build up the function F. Each Si G 
{Q,R} X {r,c,rc}, where the action of Si is described as follows. Let Fq = h 
and Fi be the function produced after application of Si. Suppose Si = (Fi,Ti), 
where Fi G {Q, R} and n G {r, c, rc}. 
liFi = Q then, 

Fi ( 1 , . . . , , Xf^ , , Xi ) 

= (1 © . . . , Xk+i,Xk, . . . , Xi) 

®Alj_|_fe(a © Fi-iib © Ai_|_fc_i, . . . , 6 © X^+ii b © X^, . . . , 6 © Ai)). 

If Fi = R, then 

Ri ^i+k— 1 j • ■ ■ 5 ^k+1 j ^k 7 • ■ • 7 ^1 ) 

= (I © Xi^k- 1 )Fi— 1 ( , Xi.^j^ — 2 • ■ • 1 ^k+1 1 ^k ^1 ) 

©Ai_|_fe_i(a © Fi-iib © Ai_|_fe, b © Xi.^.k-2 . . . , 6 © X^+ii b © X ^, . . . , 6 © Ai)). 

The value of Tj determine the values of a and b in the following manner. If Tj = r, 
then a = 0, 6 = I. If Ti = c, then a = 1, 6 = 0 and if = rc, then a = b = 1. 

It is important to note that at each step either Ti G {r,c} or Ti G {rc, c}. 
This is required to increase the order of resiliency by 1 at each step (see [4]). 
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The actual set of possible values for Ti is determined recursively as follows. If 
the order of resiliency of h is even then ri G {r, c}, else ri G {rc, c}. In general, 
if the order of resiliency of Ti-i is even then Ti G {r, c}, else Ti G {c, rc}. 

In this paper, we will solely be concerned with the implementation of F 
as represented by the sequence {h,Si, . . . , St). For cryptographic properties we 
refer the reader to [4,9]. Note that n = k + t, and F = Ft. If h has the order 
of resiliency mi, then F has the order of resiliency m = mi + t. The algebraic 
degree of F and h are same and the nonlinearity of F is 2* times the nonlinearity 
of h. 



4 Boolean Function Implementation 

In this section we provide algorithms and hardware for resilient functions on 
large number of input variables. The algorithm we present needs one step for 
initialization and then t steps in loop to generate the output. For LFSR based 
stream ciphers, the LFSRs output one bit at each clock and hence an n-bit 
input is presented to the non linear combining function at each clock. Thus an 
algorithm which takes more that one clock cycle to compute the output of the 
Boolean function will introduce delays into the system leading to a degradation of 
performance. We solve this problem by using a pipelined architecture to map the 
algorithm to hardware. The pipeline takes t clock cycles to fill up and from then 
on provides a bit of output at each clock cycle. The total delay for obtaining 
all the key bits is t clock cycles instead of a delay of t clock cycles for each 
key bit. Thus the pipeline ensures that there is no effective degradation in the 
performance of the system. 



4.1 Algorithm 

We present an algorithm to compute the output of a function F on an n-bit 
input (A„, . . . , Xk+i,Xk, . . . ,Xi). The function F is represented by 
{h{Xk , . . . , Ai), S'!, ... , Sn-k), where h is presented as a black box and can be 
implemented either by a combinational circuit or by a look up table. The algo- 
rithm requires both time and space linear in m. 

Let F be represented by (ft-. S'!, ... , Sn-k), where ft is a function of k variables. 
Define Fq = h and Fi to be a function represented by {h, Si, . . . , Si). Then 
Fn-k = F. We will refer to the recursive definition of Fi provided in Section 3. 
First we present an inefficient but obvious algorithm to compute Ft = F„-k = 
F{Xn, . . . ,Xi) based on the recursive definition in Section 4. 

recCompute{Fi{Xi+k, • ■ • , Ai)) 

1. if (i = 0) return h{Xk , . . . , Ai); 

2. if {F, = Q){X = X,+k-,} 

3. else }X — Xi^^k—i — ^i+k, } 

4. if (A = 0) return recCompute{Fi_i{Xij,.k-i, ■ . . ,Xi))-, 

5. else 

6. if {Ti = c) return 1 © recCompute{Fi-i{Xi+k-i, ■ ■ ■ : Ai)); 
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7. if (tj = c) return recC ompute{Fi-i{l © Xi+k-i , . . . , 1 © Xi); 

8. if (tj = rc) return 1 © recCompute{l © Fi-i{Xi+k-i, • ■ • , 1 © -^i); 

9. end if 

end 

Steps 2 and 3 of the above algorithm interchanges the variables Xi+k. and 
Xi+k-i if = R- The rest of the algorithm works according to the recursive 
definition of Fi. Note that the recursive approach is top down, i.e., it starts 
processing the variable Xn first and then descends to lower numbered variables. 
It is easy to see that the algorithm takes time 0{t). However, the stack depth of 
the algorithm is also 0(t), which is undesirable. Hence we map it to an iterative 
algorithm. There are a few key observations to do this. 

1. There is no need to carry the variables X^-i, ■ ■ ■ , X\ through the algorithm. 
If Fi = Q, then let Y = Xk else Y = Xk+i- Set vg = h(Y,Xk-i,...,Xi) 
and vi = /i(l © y, 1 © Xk-i , . . . , 1 © Xi). Then we will ultimately have to 
output Vi or 1 © Ui, depending on the variables X „, . . . , X^. 

2. At each recursive call, depending on the value of © we either complement 
the input or the output or both. Thus at each stage it is sufficient to record 
whether the input/output of the next evaluation has to be complemented. 
This is managed by two bit variables a and b. The variable a records whether 
the output needs to be complemented and the variable b records whether the 
input needs to be complemented. 

Based on these observations, we next present the algorithm computeTD{.), 
which converts the recursive algorithm recCompute{) to an iterative algorithm. 
computeT D{Xn , . . . , Ai) { 
if (tf^i = Q) then Y = X^-, 
if (tf/'i = R) then Y = Xk+i, 

Vq = h(Y, Xk-i , . . . , Ai); vi = ft.(l © y, 1 © Xk-i , . . . , 1 © Ai); 

a = 0; & = 0; 

for i = t downto 1 do { 

(1*) if {Fi = Q) then A = X,+k] 

(2*) if {Fi = R) then { A = Ai_|_fc_i; Xi^^_i = A^+fc; } 
if (& © A = 1) then { 

if {Ti = c) then a = a © 1; 
if (© = r) then & = 6 © 1; 
if (© = rc) then {a = a © 1; & = & © 1; } 

} 

} 

return a (Bvb; 

} 

Based on the previous discussion, we get the following result. 

Theorem 1. The algorithm computeTD{Xn, ■ ■ ■ , Xi) correctly computes 
F{Xn,...,Xi) in 0{t) time. 
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Fig. 4. Pipelined Implementation of computeT D(.) 



4.2 Hardware Implementation of computeTD{.) 

Here we show how very low cost pipelined hardware can be developed where the 
output of F on successive tuples of n-bit input is available at each clock pulse 
after initial t clocks, i.e., starting from (t + l)-th clock. 

In the hardware description, we will be manipulating F, Ti as binary values. 
To do this we need to describe how they will be encoded as bits. HFi = Q, then 
this is encoded by putting Fi = Q. li Fi = R, then this is encoded by putting 
if'j = 1. If Tj = c, then this is always coded by putting = 1. On the other hand 
Ti = 0 codes Ti = r or Ti = rc according as i ^ mi mod 2 or i = m\ mod 2, 
where mi is the order of resiliency of the initial function h (see Section 3). 

The pipeline has t stages numbered #1 to (see Figure 4). Stage stores 
the current values of . . . , Xn-i+i- The two bits vq and vi are present at each 
stage along with the two other work bits a and b. 

The initial stage (Figure 5a) of the algorithm performs the computation 
required to get the values vq,vi. For this the function h needs to be evaluated 
twice. We are assuming that each evaluation of the function h takes one clock 
cycle and hence h is implemented either as a look up table or by a small depth 
computational circuit. 

The intermediate stages of the pipeline perform the task of variable inter- 
change and updation of the bits a and h (see Figure 5b). The bits vo,vi are 
carried forward unchanged, li Fi = R the value of Xi+k and Xij^k-i should be 
properly interchanged for the next stage as in lines (1*) and (2*) of the algo- 
rithm. The 2x1 multiplexer ensures that the output is X as required by the 
algorithm. If X and b are unequal, then the two & gates are activated, other- 
wise a and b are carried forward unchanged to the next stage. If Ti = 0, then 
Ti represents r or rc and the input has to be complemented. The & of {X © 6) 
and Ti ensures this. If = 1, then r* is c and the output certainly needs to be 
complemented. Also if Ti = 0 but represents rc, then also the output needs to be 
complemented. But © can represent rc only if i — mi = 0 mod 2. The value of 
the function const{i) is {i — mi + 1) mod 2, and the combination of the or and 
& gate ensures that a is updated as required. 
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The final stage (Figure 5c) is simple. Depending on the value of b, it outputs 
either vq or wi and a is simply EXORed with the output of the 2x1 multiplexer. 

The whole circuit operates as follows. At each clock, stage forwards the 
values of the variables to the next stage and updates the values of work bits 
a, b for the next stage. The values Vq and Vi are forwarded unchanged. It is 
important to understand the need for generation of Vq,Vi at the first stage and 
carrying them through all the t stages. We need these two bits only at the end 
for the final circuit (Figure 5c). However, the values of vq, vi are generated from 
the variables Xi to X^+i- It is more efficient to carry two bits vq, v\ through the 
t stages instead of carrying the k+1 bits Xi, . . . , Xk+i. Since there are t stages, 
the whole pipeline takes t clock cycles to be completely filled up. Hence the first 
output appears at (t + l)-th clock and consequently a bit of output appears at 
each clock. 

We use both the rising and falling edge of the clock. Each stage stores two 
buffers, one input and another output (see Figure 6). At the leading edge the val- 
ues of the input buffer registers of stage are latched to the output buffer reg- 
isters of the same stage. The signals Xk, - ■ ■ , Xi+k -2 and vq, vi go directly from 
input buffer to output buffer. The other three signals Xi^k-i, a, b are generated 
through the inbuilt combinational circuit (Figure 5b) from Xi^k, Xi+k-i,ci,b 
and 'I'i, Ti- That is, the stage C in Figure 6 contains the circuit of Figure 5c. At 
the falling edge of the clock, the output buffer registers of stage are latched to 
the input buffer registers of the stage + 1). The inbuilt combinational circuit 
being small enough, it is justified to consider that the delay of the circuit is much 
less than the clock width and hence there is no problem in using both the leading 
and falling edge of the clock in the hardware. The inbuilt combinational circuit 
blocks in this architecture can also be implemented using small lookup table. 

Note that the Boolean function is reconfigurable. If we can load a new set 
of values for (<Ft,rt), . . . , (iFi,ri), then the function F will change, keeping the 
cryptographic parameters same. This will help in accessing the elements of a large 
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set of Boolean functions with minimum possible change, changing the pattern 
of 2t bit values. 

We also address the issue of synchronization at this point. The same kind of 
system will be available to both the sender and receiver. Once both sides start 
with a specific key, the first output comes after a delay of t clock cycles, i.e., 
starting from the {t + l)-th clock. Now consider the case when the key of the 
system is going to be changed. In that case, when the new key is loaded, then the 
pipeline will contain some data generated from the earlier key. The data coming 
from the next key will be operational after t clock cycles after it is loaded in the 
LFSRs. This is the same case in both the sender and receiver end. Hence, there 
is no additional requirement for synchronization in this setup. 

4.3 A Specific Example 

Consider the implementation of a function F on 24 variables. We take the initial 
function to be a 16-variable function, with order of resiliency 8, algebraic 
degree 7 and nonlinearity 2^® — 2® [6]. The function h is optimized with respect 
to the parameters considered here. Now we use a pipeline of 8 levels to get 
the 24-variable function F with order of resiliency 16, algebraic degree 7, and 
nonlinearity 2^^ — 2^^. These are also a set of best possible parameters. The 
user has an option of selecting 2 x 8 = 16 bits for . . . , (!f'i,ri), to get 

a fairly wide range (2^®) of choices for F. Further, it is possible to design a 
suitable architecture so that the values of these 16 bits can be programmable 
and the design can be implemented using an FPGA structure. Thus it is possible 
to design a reconfigurable structure which can be programmed to implement 
any one of the 2^® possible 24-variable Boolean functions F. The VLSI area 
required to implement the reconfigurable structure is rougly equal to the VLSI 
area required to implement the 16-variable initial function h. An overall delay 
of only 8 clock cycles is introduced in the system due to the pipeline. Note that 
the delay is a constant 8 clock cycles and is independent of the length of the key 
stream. 

In this system, we will have 24 different LFSRs. We implement the LFSRs 
by CA. Depending on the requirement of the total key size, we need to choose 
the length of the CAs, where the lengths of any two CAs are coprime. Let us 
consider the maximum length of an CA will be less than 128. As a specific 
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example, consider that the lengths will be the following values : 
41,43,47,53,57,59,61,67,71,73,74 = 2 x 37,77 = 7 x 11,79,83,89,93 = 3 x 
31,97,101,103,107,109,113,115 = 5 x 23,119. The total summation of these 
lengths, i.e., the key size of the system is 1931. Since the resiliency of F is 16, 
the first affine function to which F will have non zero correlation must be non 
degenerate on 17 variables. Hence, the best possible correlation attack will need 
estimating an equivalent polynomial of length x from the cipher text or the key 
sequence. Let x be the sum of the first 17 values in the above sequence. Here 
X = 1164, and hence any known attack is infeasible. It is also important to note 
that the connection polynomial of the equivalent LFSR is the product of the 
connection polynomials of the individual LFSRs. 

The two dimesional array of flip flops required to implement the CAs is going 
to be an 24 x 128 array. Thus the total number of flip flops is going to be 3072. 
Out of these only 1931 are going to be used. This is a small trade off to obtain 
a uniform design. Also note that this number of flip flops are going to be non 
functional irrespective of whether CA or LFSR is used. The use of CA ensures 
that the connection structure of the array is going to be uniform. 

5 Conclusion 

In this paper we have proposed LFSR systems employing large Boolean func- 
tions. We have described hardware implementation of large Boolean functions 
constructed using the recursive method of [4], with optimized function as an 
initial one [9]. The main point we have tried to make is that LFSR systems 
employing large Boolean functions are feasible to implement in hardware. We 
provide a reconflgurable pipelined architecture for the large Boolean function 
and propose the use of cellular automata for a regular VLSI structure of differ- 
ent length LFSRs. Given the known attacks (see [14] and the references in this 
paper) and the current advancement of the computer systems, it is improbable 
that this kind of system will be vulnerable in near future. To the best of our 
knowledge this is the first effort to consider this problem. Several questions re- 
main as to the best possible implementation and the implementation of Boolean 
functions constructed using other recursive methods. We feel these can be pos- 
sible future research topics. 
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Abstract. A high-performance implementation of the International 
Data Encryption Algorithm (IDEA) is presented in this paper. The de- 
sign was implemented in both bit-parallel and bit-serial architectures 
and a comparison of design tradeoffs using various measures is presented. 
On an Xilinx Virtex XCV300-6 FPGA, the bit-parallel implementation 
delivers an encryption rate of 1166 Mb/sec at a 82 MHz system clock 
rate, whereas the bit-serial implementation offers a 600 Mb/sec through- 
put at 150 MHz. Both designs are suitable for real-time applications, 
such as on-line high-speed networks. The implementation is runtime re- 
configurable such that key-scheduling is done by directly modifying the 
bitstream downloaded to the FPGA, hence enabling an implementation 
without the logic required for key-scheduling. Both implementations are 
scalable such that higher throughput is obtained with increased resource 
requirements. The estimated performances of the bit-parallel and bit- 
serial implementations on an XCVlOOO-6 device are 5.25 Gb/sec and 
2.40 Gb/sec respectively. 

Keywords: Gryptographic hardware, digital-design, reconhgurable-com- 
puting, performance-tradeoffs. 



1 Introduction 

Cryptography is concerned with the transfer of information between parties so 
that only the intended parties can read the data. Despite an assumption that 
an adversary may have full knowledge of the algorithms used, and has access 
to the media where data is transmitted, the aim of cryptography is to make it 
intractable to retrieve the data without knowledge of a secret piece of informa- 
tion called a key. Cryptography is an ideal application for custom computing 
machines since they offer the following advantages over VLSI technologies 

— it is possible to use the same Field-Programmable Custom Computing Ma- 
chine (FCCM) hardware for many different cryptographic protocols 



Q.K. Kog, D. Naccache, and C. Paar (Eds.): CHES 2001, LNCS 2162, pp. 333—347, 2001. 
© Springer- Verlag Berlin Heidelberg 2001 




334 O.Y.H. Cheung et al. 



— Moore’s law continues to offer improved silicon technology at exponential 
rates which is available to FCCM designers without the costly manufacturing 
process required in VLSI 

— it is possible to specialize the hardware to an extent not possible in VLSI 
devices to improve performance 

— the reconfigurable nature makes it feasible to attempt designs employing 
more sophisticated algorithms which leads to an improvement in perfor- 
mance. 

The Data Encryption Standard (DES) algorithm has been a popular secret 
key encryption algorithm and is used in many commercial and financial appli- 
cations. Although introduced in 1976, it has proved resistant to all forms of 
cryptanalysis. However, its key size is too small by current standards and its 
entire 56-bit key space can be searched in approximately 22 hours [9]. 

In 1990, Lai and Massay introduced an iterated block cipher known as Pro- 
posed Encryption Standard (PES) [16]. The same authors, joined by Murphy, 
proposed a modification of PES called Improved PES (IPES) [17], which im- 
proves the security of the original algorithm against differential analysis and 
truncated differentials [13,15,5]. In 1992, IPES was commercialized and was re- 
named the International Data Encryption Algorithm (IDEA). Some believe that, 
to date, the algorithm is the best and the most secure block algorithm available 
to the public [26]. 

Although IDEA involves only simple 16-bit operations, software implementa- 
tions of this algorithm still cannot offer the encryption rate required for on-line 
encryption in high-speed networks. Ascom’s implementation of IDEA (Ascom are 
the holders of the patent on the IDEA algorithm) achieves 0.37 x 10® encryp- 
tions per seconds, or an equivalent encryption rate of 23.53 Mb/sec, on an Intel 
Pentium II 450 MHz machine. Implementation of IDEA using the Intel MMX 
multimedia instructions was proposed by Helger [20] and achieves 0.51 x 10® 
encryption per seconds or a equivalent encryption rate 32.9 Mb/sec, on an Intel 
Pentium II 233 MHz machine. Our optimized software implementation running 
on a Sun Enterprise E4500 machine with twelve 400 MHz Ultra-Hi processor, 
performs 2.30 x 10® encryptions per second or a equivalent encryption rate of 
147.13 Mb/sec, still cannot be applied to applications such as encryption for 
155 Mb/sec Asynchronous Transfer Mode (ATM) networks. 

Hardware implementations offer significant speed improvements over soft- 
ware implementations by exploiting parallelism among operators. In addition, 
they are likely to be cheaper, having lower power consumption and smaller foot- 
print than a high speed software implementation. A paper design of an IDEA 
processor which achieves 528 Mb/sec on four XC4020XL devices was proposed 
by Mencer et. al. [23]. The first VLSI implementation of IDEA was developed 
and verified by Bonnenberg et. al. in 1992 using a 1.5 /im CMOS technology [4]. 
This implementation had an encryption rate of 44 Mb/sec. In 1994, VINCI, 
a 177 Mb/sec VLSI implementation of the IDEA algorithm in 1.2 CMOS 
technology, was reported by Curiger et. al. [7,31]. A 355 Mb/sec implementa- 
tion in 0.8 fj,m technology of IDEA was reported in 1995 by Wolter et. al. [27], 
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followed by a 424 Mb/sec single chip implementation of 0.7 /tm technology by 
Salomao et. al. [25] was reported. In 2000, Leong et. al. proposed a 500 Mb/sec 
bit-serial implementation of IDEA on an Xilinx Virtex XCV300-6 FPGA which is 
scalable on larger devices [18]. Later, Goldstein et. al reported an implementation 
on the PipeRench FPGA which achieves 1013 Mb/sec [11]. A commercial imple- 
mentation of IDEA called the IDEAGrypt Kernel developed by Ascom achieves 
720 Mb/sec [3] at 0.25 /rm technology. The implementation dervied from the 
IDEAGrypt Kernel, called the IDEAGrypt Goprocessor, has a throughput of 
300 Mb/sec [2]. 

In this paper, two Xilinx Virtex Field Programmable Gate Array (FPGA) 
based implementations of the IDEA algorithm are described. On an XGV300-6 
device, the bit-parallel implementation offers a 1166 Mb/sec encryption rate, 
while the bit-serial implementation has a throughput of 600 Mb/sec. The imple- 
mentation is scalable so that throughput and area tradeoffs can be addressed. 
Applications of these designs include virtual private networks (VPNs) and em- 
bedded encryption/decryption devices. To illustrate various design tradeoffs, an 
analysis on both of the designs in terms of area, latency, throughput and other 
design measures was carried out. 

Key-scheduling in both implementations is achieved by modifying the bit- 
stream downloading to the FPGA, in a manner similar to that described by 
Patterson in an implementation of DES [24] . Instead of doing this using the JBits 
Applications Programming Interface (API), a technique for the direct modifica- 
tion of the binary bitstream was used. The approach is advantageous because 
dedicated logic for key-scheduling is not required in the designs hence leaving 
more logic resources for performing computation. 

This paper is organized as follows. In Section 2 the IDEA algorithm as well as 
algorithms for multiplication modulo 2" -|- 1 are described. The bit-parallel and 
bit-serial implementations of IDEA are presented in Section 3 and 4 respectively. 
In Section 5 the methodology to achieve runtime reconfigurability is described. 
In Section 6 results are given. Gonclusions are drawn in Section 7. 



2 The IDEA Algorithm 

IDEA belongs to a class of cryptosystems called secret-key cryptosystems which 
is characterized by the symmetry of encryption and decryption processes, and 
the possibility of implying the decryption key from the encryption key and vice 
versa. IDEA takes 64-bit plaintext inputs and produces 64-bit ciphertext outputs 
using a 128-bit key. 

The design philosophy behind IDEA is to mix operations from different al- 
gebraic groups including XOR, addition modulo 2^®, and multiplication modulo 
the Fermat prime 2^® -|- 1. All these operations work on 16-bit sub-blocks. 

The IDEA block cipher [26] (depicted in Figure 1) consists of a cascade of 
eight identical blocks known as rounds, followed by a half-round or output trans- 
formation. In each round, XOR, addition and modular multiplication operations 
are applied. IDEA is believed to possess strong cryptographic strength because 
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bitwise XOR of 16- bit sub-blocks 

[— |— I addition modulo 2^^ of 16-bit integers 

multiplication modulo 2 '^h-I of 16-bit integers with the 
zero sub-block corresponding to 2'* 



Fig. 1. Block diagram of the IDEA algorithm. 



— its primitive operations are of three distinct algebraic groups of 2 ^® elements 

— multiplication modulo 2 ^® + 1 provides desirable statistical independence 
between plaintext and ciphertext 

— its property of having iterative rounds made differential attacks difficult. 

The encryption process is as follows. The 64-bit plaintext is divided into 
four 16-bit plaintext sub-blocks, Xi to X 4 . The algorithm converts the plaintext 
blocks into ciphertext blocks of the same bit-length, similarly divided into four 
16-bit sub-blocks, Yi to I 4 . 52 16-bit subkeys, , where i and r are the subkey 
number and round number respectively, are computed from the 128-bit secret 
key. Each round uses six subkeys and the remaining four subkeys are used in 
the output transformation. The decryption process is essentially the same as 
the encryption process except that the subkeys are derived using a different 
algorithm [26]. 

The algorithm for computing the encryption subkeys (called the key-schedule) 
involves only logical rotations. Order the 52 subkeys as Z^^\ . . . , Zg^\ Z^^\ . . ., 
Zg^\ . . ., Z^^\ . . ., Zg^\ Z^^\ . . ., Z^^\ The procedure begins by partitioning the 
128-key secret key Z into eight 16-bit blocks and assigning them directly to the 
first eight subkeys. Z is then rotated left by 25 bits, partitioned into eight 16-bit 
blocks and again assigned to the next eight subkeys. The process continues until 
all 52 subkeys are assigned. The decryption subkeys Z'^^^ can be computed from 
the encryption subkeys with reference to Table 1. 
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Table 1. IDEA decryption subkeys derived from encryption subkeys zi’’\ —Zi 
and Z~^ denote additive inverse modulo 2^® and multiplicative inverse 2^® + 1 of Zi 
respectively. 
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In Electronic Codebook (ECB) mode [26], the data dependencies of the IDEA 
algorithm have no feedback paths. Additionally, in practice, latencies of order of 
microseconds are acceptable. These features make deeply pipelined implementa- 



tions possible. 

2.1 Multiplication modulo 2” + 1 

Of the basic operations used in the IDEA algorithm, multiplication modulo 2^® + 

I is the most complicated and occupies most of the hardware. Curiger et. al. [8] 
described and compared several VLSI architectures for multiplication modulo 
2" + 1 and found that an architecture proposed by Meier and Zimmerman [22], 
using modulo 2" adders with bit-pair recoding offers the best performance. 

The C code for the multiplication modulo 2^® + 1 operation by modulo 2^® 
adders using bit-pair recoding is as follows. 

1 uintl6 mulmod(uintl6 x, uintlS y) 

2 { 

3 uintl6 xd, yd, th, tl; 

4 uint32 t; 

5 xd = (x - 1) & OxFFFF; 

6 yd = (y - 1) & OxFFFF; 

7 t = (uint32) xd * yd + xd + yd + 1; 

8 tl = t & OxFFFF; 

9 th = t » 16; 

10 return (tl - th) + (tl <= th) ; 

II } 



This algorithm requires a total of six additions and subtractions, one 16-bit 
multiplication and one comparison. However, in IDEA one of the operands of a 
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modular multiplication operation is always a subkey, so the second subtraction 
can be eliminated if the associated subkeys are pre-decreme nted. 

3 Bit-Parallel Implementation 

3.1 Multiplication modulo 2^® -|- 1 

Modulo multiplication is the bottleneck in the IDEA algorithm. In a single round 
of the algorithm there are four modular multiplications so a well-designed mul- 
tiplication modulo 2^® -I- 1 operator is crucial since it directly affects the system 
performance both in terms of area and throughput. 

The modular multiplication algorithm described in Section 2.1 was used in 
our design, but instead of taking x and y as inputs, the operator takes x and yd 
as inputs. As one of the operands is a subkey which is regarded as a constant, 
the modification eliminates one subtraction operator by taking the advantage of 
pre-decremented subkeys (Section 2.1, pseudocode line 6). 

In order to implement a well-designed multiplication modulo 2^® -|- 1 opera- 
tor, the throughput of the operator is maximized by introducing more pipeline 
stages. In our design, 16-bit mulitplier used in Section 2.1 (pseudocode line 7) is 
constructed by Xilinx CORE Generator [30] which has a latency of 4 cycles. And 
the multiplication modulo 2^® -|- 1 operator pipeline has a latency of 7 cycles. 



3.2 Bit-Parallel IDEA Core 

The IDEA algorithm is a cascade of eight identical rounds of operations, fol- 
lowed by a output transformation. By instantiating building blocks, that is, 
additions, XORs and modular multiplications, and inserting appropriate stage 
latches for time-alignment, a module for one round of computation is formed. 
For the best area-efficiency, stage latches are constructed by Virtex SRL16E 
primitives [29,10]. 

Due to limited hardware resources, each round of the algorithm shares the 
same physical resource, but with different key-schedules. Output transformation 
also reuses the resource. In our implementation the key-schedules are stored 
inside ROM primitives. The architecture of the bit-parallel IDEA core is shown 
in Figure 2. 

As mentioned earlier, for ECB mode operations, data dependencies of the 
IDEA algorithm have no feedback paths. This property enabled the round ar- 
chitecture to take input values until the pipelined is filled, and output values 
are redirected to the input of the pipeline subsequently. In an IDEA round, 
the data passes through three multiplication modulo 2^® -|- 1 operators, each of 
which has a latency of 7 cycles. Thus the full round pipeline has a latency of 
21 cycles For an output transformation, the data must pass through a single 
multiplication modulo 2^® -|- 1 operator with pipeline latency of 7 cycles. There- 
fore the core has a total latency of 21 x 8 -I- 7 = 175 cycles. The core takes 21 
64-bit plaintexts per 21 x 9 = 189 cycles, equivalently performing encryption at 
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X 




(21 189) X 64 X / Mb/sec with a system clock rate of / MHz. For instance, at 

a 82 MHz clock rate, the core delivers an encryption rate of 583 Mb/sec with a 
latency of 2.134 fj,s. 



4 Bit-Serial Implementation 

The bit-serial implementation mentioned below is an improved implementation 
of [18]. By register reordering and register duplication, the improved implementa- 
tion offers an encryption throughput of 600 Mb/sec, 20% faster than the original 
implementation. 

Bit-serial architectures are characterized by the property that operators per- 
form their computations in a bitwise fashion and communications between op- 
erators are multiplexed in time over a single wire. Dataflow begins with either 
the least significant bit or the most significant bit, but the former is more com- 
monly used due to its compatibility with two’s complement arithmetic. In a 
typical bit-serial implementation, each variable is associated with a control sig- 
nal which is set high only when the first bit is transferred along associated data 
bus. To reduce area, control signals can be shared among the variables. Since 
bit-serial operators usually require the first bits of their operands to enter the 
operators on the same clock cycle, appropriate stage latches must be inserted 
for time-alignment [12]. 

Two of the primitive operators used in IDEA, namely XOR and addition 
modulo 2^®, can be easily implemented bit-serially. These two operators have 
latencies of one clock cycle and are capable of taking consecutive bit-serial 
operands. The multiplication modulo 2^® -|- 1 operator has a latency of 35 clock 
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cycles. As in the parallel implementation, stage latches and constants are imple- 
mented using SRL16E primitives. Additionally, constants are also implemented 
as SRL16E primitives, with its output connected to its input to form a cyclic 
shift register. 

4.1 Multiplication modulo 2^® + 1 

The modular multiplication algorithm described in Section 2.1 was directly ap- 
plied in the bit-serial implementation of the algorithm. The operator optimiza- 
tion used in the bit-parallel implementation, described in Section 3.1, was not 
applied in the bit-serial implementation because comparisons in bit-serial archi- 
tectures are not efficient in terms of latency. 

An N X A^-bit multiplier generates a 2iV-bit result, and requires 2N cycles to 
complete. Thus, throughput of bit-serial multipliers are restricted because the 
minimum interval between consecutive multiplications must be at least 2N cy- 
cles. In the IDEA algorithm one of the operands of every modular multiplication 
is a subkey and treated as a constant. 

Recall in the modular multiplication algorithm that the intermediate result t 
is divided into two portions (Section 2.1, pseudocode line 7-9). The two portions, 
th and ti, are respectively the upper and lower 16-bits of the double-word, which 
are operands to subsequent operations. A design that computes the upper and 
lower words of t independently is desirable, allowing all the inputs, outputs and 
intermediate variables of the operator to be 16-bit long. Using this scheme and 
duplicating hardware, the throughput of a modular multiplication operation can 
be doubled. 

A modified version of Lyon’s parallel-serial multiplier [21] was developed 
which addresses this problem. To generate two 16-bit results in 16 cycles, the 
throughput of the multiplier must be doubled. We achieved this by duplicating 
the hardware for multiplication, as illustrated in Figure 3. Registers storing the 
constant are shared among the two multiplication pipelines. The outputs p and 
q correspond to the results of two consecutive multiplications, where the two 32- 
bit long variables have a time-difference of 16 cycles. The control signal, which 
is high one clock cycle before the least significant bit enters the module, toggles 
the control register. The vector of input variables a„_i . . . aiOo is consequently 
redirected into the two multiplication pipelines alternately. While the vector is 
being redirected to one pipeline, logic zero enters the other pipeline carrying out 
zero-padding. 

To obtain the time-aligned upper and lower words of t, a 16 stage shift reg- 
ister is required. The input and output of the shift register are the upper and 
lower words of t respectively, 16 cycles after t is valid. In the implementation 
the shift register is implemented as a SRL16E [29] primitive. The complete ar- 
chitecture for the modular multiplication operation is shown in Figure 4. Upon 
initialization, the subkey associated with the operator is passed into the opera- 
tor bit-serially. The pre-decremented subkey is shifted into the registers of the 
multiplier, and at the same time stored into the SRL16E primitive responsible 
for key storage. 
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Fig. 3. Parallel-serial multiplier modified for increased throughput. 




Fig. 4. Bit-serial architecture for multiplication modulo 2^® -|- 1 operations. 



Utilizing the idea of multiple pipelines, the modular multiplication operation 
offers a throughput of 16 cycles, even though a 32-bit intermediate result is com- 
puted. This scheme doubles the throughput but since sharing of the b registers 
can occur, the hardware cost is less than double. 

4.2 Bit-Serial IDEA Core 

The core implementation of IDEA is obtained by cascading eight identical rounds 
of operations shown in Figure 5, followed by a output transformation. The core 
takes one 64-bit plaintext once every 16 cycles, yielding an effective encryption 
rate of / x 64 4- 16 Mb/sec at a system clock rate of / MHz. At 150 MHz, for 
example, the performance of the core is 600 Mb/sec. 

Each round has a latency of 109 cycles. The output transformation has a 
latency of 35 cycles. Each serial-to-parallel converter at the outputs has a latency 
of 16 cycles. Therefore, the IDEA core has an overall latency of 109x8-1-35-1-16 = 
923 cycles. At a 150 MHz system clock rate, the equivalent latency is 6.153 /is. 
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Fig. 5. Bit-serial architecture for one round of IDEA algorithm. 



5 Bitstream Reconfiguration 



The basic building block of the Virtex FPGA is the logic cell (LC). A LC includes 
a 4-input function generator, carry logic and a storage element. Each Virtex 
Configurable Logic Blocks (CLB) contains four LCs, organized in two slices. 
The 4-input function generator are implemented as 4-input LUTs. Each of them 
can provide the functions of one 4-input LUT or a 16 x 1-bit synchronous RAM 
(called “distributed RAM”). Furthermore, two LUTs in a slice can be combined 
to create a 16 x 2-bit or 32 x 1-bit synchronous RAM, or a 16 x 1-bit dual-port 
synchronous RAM. 

The contents of the LUTs upon initialization are encoded in the bitstream. 
Xilinx disclosed the format of bitstream for Virtex series FPGA [14,6] , hence it 
is possible to edit the bitstream and alter the contents of LUTs. Our approach to 
achieve runtime reconfigurability is to build all configurable blocks from LUTs 
and later modify the bitstream. 

More specifically, the key-schedule is stored only inside ROM or SRL16E 
primitives which are implemented as LUTs. After technology-mapping, place- 
ment and routing, a circuit description (with a .ncd extension) is generated. 
Using the ncdread tool provided by Xilinx, the contents of the circuit descrip- 
tion can be converted into a human-readable format. It is possible to extract the 
physical location of individual LUTs from the output of ncdread. 

We have developed software to customize bitstreams for different key-schedules. 
In the first step (which need only be performed once for a given design), infor- 
mation concerning the physical location of individual LUTs which are used in 
the key schedule is extracted from the .ncd file and written to a location file 
(locf ile). To modify a bitstream, the LUTs are directly modified by a program 
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to use a given key-schedule. The pseudocode below describes the technique that 
was used. 

1 changekeys (locf ile , bitstream) 

2 { 

3 locdb = readClocf ile) ; 

4 foreach bit in the key 

5 { 

6 Find location of the LUT using locdb; 

7 Modify the value of the bit in the LUT; 

8 } 

9 Recompute CRC for the bitstream; 

10 Write bitstream; 

11 } 

On an Intel Pentium III 866 MHz machine, the reconfiguration process re- 
quires the modification of 6 x 16 LUTs and changing a key takes approximately 
0.12 seconds. 

In some applications, runtime reconfiguration may not be desirable e.g. if 
the bitstream is placed in a ROM or in an Application-Specific Integrated Cir- 
cuit (ASIC) implementation. For these cases, shift registers can be employed 
for the key-schedule. The shift registers are linked to form a large shift regis- 
ter when key-schedules are being fetched. This long shift register breaks down 
into the original shift registers after initialization. This method requires minimal 
logic and routing resources. 

6 Results 

Both the bit-parallel and bit-serial IDEA processor was verified with Synopsys 
VHDL Simulator, and was synthesized using Synopsys FPGA Express 3.5 and 
Xilinx Foundation Series 3.1i, with Xilinx Virtex XCV300-6 as target device. 

Our serial and parallel implementations of IDEA were successfully imple- 
mented on Annapolis Micro Systems Wildcard Reconfigurable Computing En- 
gine [1]. The device is a Type II PCMCIA Card with a 33 MHz 32-bit CardBus 
interface, consisting of an Xilinx Virtex XCV300-6 FPGA as Processing Ele- 
ment (PE) and two 64k x 32-bit SDRAMs. A single core parallel implementation 
was also tested using a Pilchard card [19] which uses a memory slot interface 
instead of a CardBus interface. 

6.1 Performance of IDEA Core 

For the bit-parallel implementation, a single core/round of the algorithm requires 
1178 Virtex slices. An XCV300 device can accommodate two rounds of the al- 
gorithm, accounting for 2444 slices (including extra logic required for scaling), 
or 79.56% of the total 3072 slices. 
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For the bit-serial implementation, the fully-pipelined implementation (8 rounds 
plus output transformation), with parallel-to-serial converters at inputs and 
serial-to-parallel converters at outputs, requires 2878 Virtex slices which occupies 
93.68% of CLB resources. 

It was observed that the building blocks offer faster computations in the 
stand-alone configuration, but performance degrades when they are being used 
as components in the hierarchical design. Hence, core performance improvement 
may be obtained by floorplanning, such that inter-component routing is mini- 
mized. The performance of the cores (assuming a high-bandwidth interface to 
the data sources and sinks) is summarized in Table 2. 



Table 2. Summary of performance for the two implementations on an Virtex XCV300- 
6 device. 





Bit- parallel 


Bit-serial 


Number of Cores 


2 


1 


Clock rate (MHz) 


82.0 


150.0 


Encryptions per 
second (xlO®) 


18.220 


9.375 


Encryption rate 
(Mb/sec) 


1166.08 


600.0 


Latency (/is) 


2.134 


6.153 



In an attempt to explore tradeoffs between performance and area, the core 
was generated for FPGAs of different capacities. Since there are no data depen- 
dencies, the implementations can be easily scalable by instantiation of multiple 
cores. The designs were maximally scaled within the resource limitation of each 
device to produce the results summarized in Table 3. 

6.2 Performance 

On the Wildcard implementation, the time taken to complete a transaction be- 
tween the FPGA and host is dominated by the setup time of GardBus interface. 
When designing the interface between the IDEA core and the host, it is crucial 
that the number of discrete transactions is minimized and the amount of data 
transfered per transaction is maximized. 

Data from host is written directly to the core using a burst mode transfer of 
1024 64-bit plaintext blocks. After the latency period, the ciphertext is written 
to consecutive locations in the BlockRAM. For XGV300 devices, there are eight 
256 X 32-bits BlockRAM [28] on the chip and they are all used in the host /IDEA 
interface. The results are read by the host from the IDEA processor by doing a 
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Table 3. Tradeoffs between performance and area of the IDEA cores on different 
devices. 



Device (speed grade -6) 


Bit-parallel 

XCV300 XCV600 XCVIOOO 


Bit-serial 

XCV300 XCV600 XCVIOOO 


Scaling 


2x 


5x 


9x 


Ix 


2x 


4x 


Number of slices 


2444 


6368 


11602 


2878 


5756 


11512 


Device utilization 


79.56% 


92.13% 


94.42% 


93.68% 


83.28% 


93.68% 


Encryptions per second (xlO®) 


18.220 


45.551 


81.991 


9.375 


18.750 


37.500 


Encryption rate (Mb/sec) 


1166.1 


2915.3 


5247.4 


600.0 


1200.0 


2400.0 



burst mode transfer of the contents of the BlockRAM. The decryption process 
is similar except the ciphertext is written to the IDEA core and the plaintext 
appears in the BlockRAM. 

The interface between host and IDEA core on Wildcard requires approxi- 
mately an additional 160 slices, resulting in a total of 2606 slices (84.83%) and 
3039 slices (98.93%) utilization of the XCV300 for the bit-parallel and bit-serial 
implementations respectively. 

The burst transfer rate of CardBus is 33 x 32 = 1056 Mb/sec. However, due to 
large overheads in the CardBus transactions, both the implementations achieve 
a measured performance of 0.61 x 10® encryptions per second (39 Mb/sec) on 
a 300 MHz Intel Pentium II laptop computer. The situation could be improved 
by using Direct Memory Access (DMA) channels. In addition, utilizing the two 
64k X 32-bits SDRAMs on Wildcard could provide a larger buffer for ciphertext 
storage hence reducing the number of transactions. 

6.3 Pilchard 

In an attempt to improve the PC to FPGA data transfer rate, the bit-parallel 
implementation was ported to a Pilchard card [19] which utilizes a memory 
slot interface for improved performance over a CardBus interface. The Pilchard 
card used the same XCV300-6 device as in the Wildcard. The implementation 
uses only a single IDEA core/round and requires a total of 1319 slices (42.93% 
utilization). Pilchard offers a higher bandwidth between the PC and FPGA and 
the implementation achieved a measured encryption performance of 146 Mb/sec 
on an Intel Pentium HH 866 MHz desktop PC. 

7 Conclusion 

Two high-performance runtime reconfigurable implementations of the IDEA al- 
gorithm were presented in this paper. In both designs, the bitstream is customize 
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for a particular key and this procedure saved hardware resources in our design. In 
implementations on the same XCV300-6 part, the bit-parallel version achieved 
an encryption rate of 1166 Mb/sec using an 82 MHz clock, whereas the bit-serial 
implementation achieved a 600 Mb/sec throughput at a clock rate of 150 MHz. 
The bit-parallel implementation achieved a higher throughput with lower latency 
than the bit-serial implementation, while the bit-serial implementation permits 
a minimal area fully-pipelined design. 
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Abstract. This work proposes a new elliptic curve processor architec- 
ture for the computation of point multiplication for curves defined over 
helds GF{p). This is a scalable architecture in terms of area and speed 
specially suited for memory-rich hardware platforms such a field pro- 
grammable gate arrays (FPGAs). This processor uses a new type of 
high-radix Montgomery multiplier that relies on the precomputation of 
frequently used values and on the use of multiple processing engines. 



1 Introduction 

This work introduces, to the authors’ knowledge, the first documented processor 
architecture for the computation of elliptic curves point multiplications for curves 
defined over fields GF{p). Hardware implementations have been documented for 
the computation of point multiplications for curves defined over GF{2™). The 
most notable implementations include [1,2, 3,4, 5, 6]. 

The architecture presented here is based on the standalone elliptic curve pro- 
cessor architecture introduced in [6]. This architecture is modular, programmable, 
and suitable for algorithms that rely on precomputations. 

Multiplication is typically the most critical operation in the computation 
of elliptic curves point multiplications. The architecture introduced here uses a 
Montgomery multiplier. This type of multiplier has been the subject of extensive 
research, see for example [7,8,9,10,11]. 

For the elliptic curve processor (ECP) introduced here, this work devel- 
ops a new multiplier architecture that draws from [9,12] an approach for high 
radix multiplication, from [8,9] the ability to delay quotient resolution, and 
from [10] the use of precomputation. In particular, this work extends the con- 
cept of precomputation. The resulting multiplier architecture is a high-radix, 
precomputation-based modular multiplier. 

2 Mathematical Background 

This section provides a brief introduction to elliptic curve point multiplication. 
Additional information can be found in [13,14]. 
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The ECP computes elliptic curve point multiplications for arbitrary curves 
defined over GF(p). Point multiplication is defined as the product 
kP = P + P + . . . P, where k is an integer and P is a point on the elliptic 

k times 

curve. For fields GF{p), the curves of interest are defined by y'^ = + ax + b, 

where 4a^ + 27b^ ^ 0 mod M and M > 3. 

One can visualize the computation of point multiplications as a hierarchy of 
processing functions. At the top of the hierarchy are the point multiplication 
functions. These functions compute point multiplications with repeated point 
additions and point doubles. At the next level of the hierarchy are the point 
addition and point double functions, which are intimately related to the co- 
ordinates used to represent the points. At the bottom of the hierarchy are the 
finite field functions required to perform the point addition and the point double 
functions. Figure 1 shows how this hierarchy maps into the FCP architecture. 

The FCP is best suited for the computation of point multiplications using 
projective coordinates. When compared against algorithms for affine coordinates, 
algorithms for projective coordinates trade inversions in the point addition and 
in the point double operations for a larger number of multiplications and a 
single inversion at the end of the algorithm. This inversion can be computed 
with repetitive multiplications: a~^ mod M = mod M, for prime modulus 

M. 

The FCP uses a Montgomery multiplier. The main advantage of this type of 
multiplier is that it facilitates quotient estimation and facilitates carry propaga- 
tion in hardware adders. Their main disadvantage is that they compute weighted 
products: mult(A,i?) = ABR~^ mod M, where i? is a constant. 

For Montgomery multiplication to be effective, the input operands to the 
point multiplication algorithm must be transformed into weighted residues of 
the form AR mod M . The algorithm is then executed using these residues. At 
the end of the algorithm, the results are then transformed back to not weighted 
residues. Note that as described in [15] the addition and subtraction of these 
residues can be performed using traditional modular addition and subtraction 
operations. For most cryptographic algorithms, the cost of these transformations 
is amortized over a large number of operations. 



3 Processor Architecture 

The elliptic curve processor (ECP), shown in Figure 1, consists of three main 
components. These components are the main controller (MC), the arithmetic 
unit controller (AUC), and the arithmetic unit (AU). The MC is the ECP’s 
main controller. It orchestrates the computation of kP and interacts with the 
host system. The AUC controls the AU. It orchestrates the computation of 
point additions/subtractions, point doubles, and coordinate conversions. It also 
guides the AU in the computation of field inversions. The AU is the hardware 
that computes field additions/subtractions and multiplications, and performs 
comparisons. 
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Fig. 1. Elliptic curve processor architecture 

The following is a typical sequence of steps for the computation of kP in 
the ECP using the double-and-add algorithm and the projective coordinates 
algorithms shown in the appendix. 

First, the host loads k into the MC, loads the coordinates of P into the AU, 
and commands the MC to start processing. The MC does its initialization and 
then commands the AUC to do its initialization. The AUC initialization includes 
the conversion of P from affine to projective coordinates and the conversion of 
these coordinates into weighted residues (A = XR mod M, Y = YR mod M, 
Z = R mod M). During the computation of kP, the MC scans one bit of k at 
time starting with the second most significant coefficient and ending with the 
least significant one. In each of these iterations, the MC commands the AU/AUC 
to do a point double. If the scanned bit is a 1, it also commands the AU/AUC to 
do a point addition. For each of these point operations, the AUC generates the 
control sequence that guides the AU through the computation of the required 
field and comparison operations. After the least significant bit of k is processed, 
the MC commands the AU/AUC to convert the result back to affine coordinates. 
The AU/AUC first converts the result to affine coordinates and then converts 
the coordinates to not weighted residues {x, y). Then, the MC signals to the host 
the completion of the kP operation. Finally, the host reads the coordinates of 
kP from the AU. 

The FCP uses two loosely coupled controllers, the MC and the AUC, that 
execute their respective operations concurrently. These are programmable pro- 
cessors that execute one instruction per clock cycle. 

The AU incorporates a multiplier, an adder (or adders), and a register file, all 
of which can operate in parallel on different data. The AU’s large register set sup- 
ports algorithms that rely on precomputations. An example of a precomutation- 
based algorithm is an adaptation of a fixed base exponentiation method intro- 
duced in [16] for operations involving a known point. This algorithm requires on 
average \ rn/w\ -1-2“' point additions, the storage of \m/w'\ points, and no point 
doubles. In the previous expressions, w is the window size, which is a measure of 
the number of bits processed in parallel. To illustrate the benefits of precompu- 
tation, consider a fixed point multiplication for an arbitrary curve defined over 
— 2®^ — 1), which is one of the fields recommended in [17]. Compared to 
the traditional double-and-add algorithm, the fixed point algorithm is over four 
times faster (assuming the use of the projective coordinates in [18] with Z = 1 
and w = Y). 
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4 Arithmetic Unit 

The Arithmetic Unit (AU) is the ECP’s main processing unit. As Figure 2 shows, 
it consists of a register file, an adder (or adders), and a multiplier. The multiplier 
is the AU’s most critical component, and, consequently, it is the component that 
drives the AU’s architecture. The AU’s architecture is defined at a high level by 
the multiplication algorithm it implements and at a low level by the number 
representation it uses. 




Fig. 2. Arithmetic Unit 



The most popular cryptographic algorithms in use today require arithmetic 
with large operands (160 . . . 1024+ bits). To achieve a high rate of computation, 
most hardware implementations resort to iterated multiplication methods that 
approximate the desired result rather than computing exact ones. The approx- 
imated results are then refined to exact results in post-processing operations. 
The tradeoff is accuracy for speed. The ECP’s multiplier is an example. It im- 
plements an iterated multiplication algorithm that approximates the multipli- 
cation of AR mod M and BR mod M as ABR mod M + eM, where eM is a 
measure of the accuracy of the multiplication. Note that for the basic forms of 
Montgomery multiplication e = 1. 

Number representation is an important element of an arithmetic architecture. 
It defines how the numbers are represented and consequently how arithmetic is 
conducted. The selection of a number representation is influenced by the design 
methodology, the target architecture, and the area-time (or cost-speed) goals. 
The ECP architecture is independent of number representations. To validate the 
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ECP architecture a prototype that uses redundant number representation was 
developed. The implementation results are discussed in section 8. 

5 Modular Multiplication Algorithm 

Algorithm 1 shows the ECP’s main multiplication algorithm. This algorithm is 
a generalized version of the Montgomery multiplication algorithm with quotient 
pipelining introduced in [9]. This generalized version supports positive and neg- 
ative operands and incorporates Booth recoding and precomputation. Positive 
and negative numbers arise naturally in Booth recoding and they are often used 
in elliptic curve algorithms. 

Booth recoding is a technique that allows the representation of a two’s 
complement number B = X)i=o ^*2™, where s = n/u, G [0, 2“) and 

Bs-i G [-2“-i,2“-i), as B = E -Zq B'2™ where B[ G Here we 

assume that B is represented by an integer number of digits of radix 2“ and also 
that its most significant bit represents the sign. 

This work uses the Modified Booth Algorithm, which is a window based 
method [19,20]. This method uses s windows, where each window^ groups the 
set of bits {biu^(u-i)biu+(u-2)--biu-i)2 for i = 0..s — 1, and where b-i = 0 
(s. = E“Zo biu+j2^)- The set of bits enclosed by window^, is encoded as B[ = 

+ (E“Zo biu+j‘^^) + biu-i - Note that in Algorithm 1 the recoding 
is done on a digit-by-digit basis. For this algorithm r, s, u, and v divide k. The 
variables qhi and bhi are respectively the most significant bits of Qi and Bi. 

The validity of Algorithm 1 can be proven using an induction argument sim- 
ilar to the one used in [9] to prove the validity of the Montgomery multiplication 
algorithm in which this algorithm is based. One can verify with induction on i 
that Equation (1) defines an invariant of the loop. For this verification note that 
[S'i/2'^J defines a truncated division equivalent to {Si — Qi)j2^ . 

Notation: The symbol |a;|^ is used to express an approximate modulo re- 
duction that satisfies the following relation: \x\j^ = x mod M + eM = x mod M. 
\x\j^f is used to express least residue; that is, | \x\j^ \ < M, where the symbol 
jyj represents the absolute value of y. 

Using the loop invariant in Equation (1), one can verify that when i = n-|-ci-|-l 
the output of the algorithm satisfies Equation (2). This equation establishes that 
the multiplication output is Sn+d +2 = (’^ote that QM = 0 mod M). 

This equation also defines the range, or accuracy, of the multiplication result in 
terms of the maximum values of A and B (note Bi>n = 0); the maximum value 
for the reduction terms, QAi, which is defined in Equations (4-5) (note Qo = 0); 
the value of the multiplication constant R = 2^"; and the quotient resolution 
delay, d. 
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Algorithm 1: Modular Multiplication with Precomputation 

Inputs: 

A G {-A, A), M > 0 

B = ESf ^*2^=* G B>0, G [0,2'=), G [-2'=-i, 2'=-i), 

Bi>t = 0, t < n 

a = 12“'=^='“’"'^^ I gcd{M,2) = 1, R = 2'=", d - quotient resolution delay 

Output: 

Sn+d+2 = \AB/R\^^ G i-{AB + QM)/R, {AB + QM)/R) 

/* Pre-processing */ 

1. ^0 = 0, g,<o = 0 

2. for z = 0 to 2''~^ do 

2.1. A[i\ = iA 
end for 

3. for z = 0 to 2““^ do 

3.1. for j = 0 to u — 1 do 

3.1.1. a[z,j] = |za2“^|^ 
end for 

end for 

/* Processing */ 

4. for z = 0 to zz -I- d do 

/* Quotient Determination */ 

4.1. Qi = \Siy 

/* Recoding: qlj G [-2“-', 2“-']; blj G [-2’"-\ 2’'-']; qhj,bhj G [0,1] */ 

4.2. (7/1,2'= + (?/,„+, 2“J = Q, + (7/i,_i /* k = uv */ 

4.3. if z < n then 

6/ii2'= + Wi.+,2'-^‘ = R, + bh,-i /* k = rs */ 

else 

bhs+j2^^ = 0 /* Ri>„ = 0 */ 

end if 

Reduction */ 

4.4. “[ . J 

4.5. ABi = X;id ^[ \bks+j\ ]{stjn{bhs+j))2-^ 

4.6. S',+1 = [S'i/2'=J + 

end for 

/* Post-processing */ 

5- Sn+d+2 = 2'=='S'„+d+i + xd=0^ Qn-|-l-|-i2'=* -f qhn 

Loop Invariant: 

d-l 

2^3. + 2'=(*--=') ^ Q,+,_d2'=^ + g/z,_d_i2'=('-") = 

1=0 

i—1 i—d—2 

*a((Y. Rj2'=g - 6/z,_i2'=') + 2'= E (Qa,-+i2'=(‘=*+i) - Q,+i)2'=^' 

j=o 1=0 



( 1 ) 
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Result after n+d+1 loop iterations: 



d-l 

2^'^Sn+d+l + E Qn+l+j‘2^^ + <lhn 



3=0 



AB + E”=o - Q.+i)2'=^' 

R 



( 2 ) 



= {AB + QM)/R 

G {-{AB + QM)/R, {AB + QM)/R)) 



QM: 



V—1 

= (3) 

i-0 
n— 1 

QM = (4) 

3=0 

G i-QM,QM) 

QM > max{ \QM\ ) (5) 

Note that implementations can take advantage of the parallelism defined in 
steps 4. 2-4. 6 of Algorithm 1 without using Booth recoding. These implemen- 
tations can set qhi = hhi = 0 for all i, and use digits qky+j G [0,2“) and 
bk,+j G [0,2'-). 

6 Analysis of Modular Multiplication Algorithm 

In order to realize an area efficient multiplier, the ECP implements Algorithm 1 
using precomputation. Precomputation reduces the complexity of the multiple 
input adder needed to add all the terms in step 4.6 of Algorithm 1 at the expense 
of a set of additions at the beginning of the algorithm (steps 2 and 3) and 
storage. The issues associated with the implementation of Algorithm 1 using 
precomputations are studied in the next sections. 



6.1 Accuracy 

The accuracy of the modular multiplication result is influenced by the range 
of the input operands, the method employed to compute reduction terms, the 
multiplication constant R, and the quotient resolution delay d. 

In [12] two methods are defined for the computation of the reduction terms 
Ixal^. These methods are referred to as multiplication-based and lookup-based 
reduction methods. The multiplication-based approach computes |xo ;| ^ = x \a\j^ 
The lookup-based method computes |a:Q;|^ = The accuracy of one 

multiplication-based and two lookup-based reduction methods are summarized 
in Table 1. (Note that the reduction method affects the value of Qa^.) 
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i? is a design parameter that influences the reduction accuracy of the multi- 
plier. As Table 1 shows, the accuracy of the result is bounded by the magnitude 
of QM, which grows proportionally with R. For applications requiring iterated 
multiplications, such as modular exponentiations, R is often chosen so that the 
accuracy of a multiplication result falls in the range {—2QM/R,2QM/R), or 
[0,2QAd/i?) when handling only positive numbers. Examples for this last appli- 
cation can be found in [9], for which A,B€ [0, R > 

and QM G [0, 2'=(^+i)Mi?). 

The results in Table 1 correspond to the worse case values for QM, where 
the value of a modulo reductions is approximated as M . Parameter selection 
can greatly improve the accuracy and speed of a multiplication; for example, 
[17] specifies modulus of the form M = TOi2* +/_ 1, for which = 

(M “/+ l)/2^^ for t > kx. For t > k{d + 1), QM < MR for all the reduction 
methods listed in Table 1. 



Table 1. Accuracy of multiplication- and lookup-based reduction methods 



Red. method 


QcV-i 


QM (worst case) 


Multiplication 




2^j 

M 


2k(d+i)j^ ( 


' 2^^-l \ 
, 2--1 ) 


< 2'^A+i)mR 


Lookup 1 


2^j=o 


qk,+j2->^A+i) 


M 


2fc(d+i)^ 


( 2'“"-! \ 
2^-1 ) 


< 2''(‘*+bMi?(^) 


Lookup 2 


— 1 

2^j=o 




2>‘A+i)m 


j u < 2*‘'^MR{2v) 



6.2 Processing Time 

Equation (6) provides a processing time approximation for Algorithm 1. This 
equation assumes that a single precomputation engine performs all the precom- 
putations and transmits them to the respective processing units.Jn this eqimtion 
Tb and Tq represent the processing time for the computation of ABi and Qoj for 
i = 0..n + d. The processing time is the sum of the precomputation time, which 
is identified with the p subscript, and the processing time, which is identified 
with the m subscript. (Note that the processing time Tbm include n processing 
operations because Bi>n = 0. This condition does not apply to Tqm-) 

The expression in Equation (6) is normalized with respect to a reference unit 
of time. The processing cost of a precomputation operation is weighted by a 
factor a and the processing cost of a processing operation is weighted by the 
factor b. The factor c defines the number of multiplications over which the pre- 
computation cost is amortized. (Note that it is common in many cryptographic 
algorithms to perform a large number of consecutive operations using the same 
modulus.) 

The factor e represents the number of precomputation sets to bejyomputed. 
As written. Algorithm 1 requires one set for the scalar products ABi and up 
to V sets for the scalar products Note that for the Lookup 2 reduction 

method v sets need to be computed in step 3.1. For the multiplication and the 
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Lookup 1 reduction methods, the precomputation engine can broadcast a single 
precomputation set to the relevant processing engines. For the precomputation 
of a single set, eliminate the loop in step 3.1, compute a[i] = |ia|^ in step 3.1.1, 
and compute Qa^ = ]{sign{qliy+j))2'^^ in step 4.4. 

Tmm = Tb + Tq = {Tbp + Tbm) + {Tqp + Tq^) (6) 

= + m) + + b,{n + d)) 

To determine the optimum number of precomputations it is best to express 
Equation (6) in terms of m = \l 0 g 2 M~\ . Equation (7) provides an approximation, 
where n = \m/k'\ +d+/, / is a constant and R = According to Equa- 
tion (2) and the possible cases in Table 1, / G [0,2] when A — B = R 

and the target multiplication accuracy is Sn+d +2 G (—2{QM/R),2QM/R). 
These parameters are of interest here because they define a small number of 
iterations for Algorithm 1 that generate results suitable for repeated multipli- 
cations and they also allow a number of additions to be performed between 
multiplications without the need for reduction. Unless otherwise specified, this 
document will assume the use of the aforementioned parameters for general mul- 
tiplications. 



Tmm= (^“^2’'-i + 64m/rsl +66(d+/)) (7) 

6.3 Operations of Interest for Scalar Point Multiplication 

Table 2 list some of the operations of interest in the computation of point multi- 
plications. This table assumes that A = B = 2^^‘^QAij R. Entries 1 and 2 in this 
table are used in the projective coordinate algorithms defined in [18]. Entry 1 
corresponds to the classical multiplication operation. Entry 2 defines a division 
by 2 requiring just d -|- 2 iterations of the loop in Algorithm 1. Entry 3 defines a 
multiplication of a special form which is used here to reduce the magnitude of 
a value presumed to be |0 |m before comparing it to zero. Note that for Entry 
3, QAi is defined with respect to a: (n = x) as shown in Table 1, and this value 
may be different from the value of QAi used to define A. 

Some of the elliptic curve algorithms defined in the open literature, such as 
the ones in [18], use comparisons in time critical functions, such as point addition 
and point double. Comparisons are used, among others, to identify the point at 
infinity during point add and point double operations. These comparisons involve 
field elements, therefore numbers A and B are considered equal ii A — B = [0]^, 
which implies that their difference is a multiple of M. 

The accuracy of Algorithm 1 is of the order QAi / R, where QAi is defined 
in Table 1. Rather than adding specialized circuitry to perform this function. 
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here we recommend an approach that multiplies a value presumed to be zero 
by a constant. The idea is to perform this multiplication with high accuracy in 
a short amount of time. To achieve high accuracy, we recommend the use of 
Algorithm 1 with low quotient resolution delay (d « 0) and possibly by using a 
more exact version of Algorithm 1 (see Table 1). To achieve a short processing 
time, we recommend multiplication by according to Table 2, where the 

parameter x is adjusted so that the value of the multiplication result is close to 
the value of M. 

The recommended algorithm for the comparison of two field elements A and 
B works as follows. First compute A — B. The result of this operation is a 
multiple of M if A = and, if that is the case, |(A — B)/2^^\j^ will also be 

a multiple of M. Then, compute |(A — according to Table 2. Finally, 

refine the result to a value in the range (— M, M) and compare it against 0. 

For the comparison of A with zero, assume that A = which 

could correspond to the Multiplication-based reduction method shown in Table 
1, and that the operation | A/2''^ is done using the Lookup 2 reduction method 
with d = 0 and x = da + 2, where da corresponds to the quotient resolution delay 
associated with A. For this example, the result of | falls in the range 

{—{2v+l)M, (2v+l)M). This result can be computed with do+3 iterations of the 
loop in Algorithm 1, but because these iterations are computed without quotient 
resolution delay {d = 0) each one can take up to da clock cycles. In other words, 
a multiplier can compute multiplications with and without quotient resolution 
delay, but when performing operations involving no quotient resolution delay, 
the multiplier must wait for the quotient resolution. The quotient resolution is 
assumed to take up to da clock cycles. 

Note that the algorithm just described is useful for a large set of applications. 
If additional accuracy is needed for the reduction operations, one can implement 
in the ECP a more accurate version of the multiplication algorithm, one such 
algorithm is presented in [9] . 



Table 2. Multiplications of interest 



# 


Mult. 


B 


AB 


R 


n 


'S'n + d+2 


1 


\ABR-^ 


\m 


B 


< 2’^iQMIRf 




l0Q2k R 


< 2QMIR 


T 


I^/2|m 




2'^-^A 




1 


< A 


T 


A/2'=" 




1 


A 


(^kx 


X 


< iA+QM)/2'^^ 



6.4 Area and Storage 

The most complex^operation of Algorithm 1 is the computation of the two scalar 
multiplications ABi and Qa^ — the multiplication in step 5 is just a shift opera- 
tion. These scalar multiplications can be computed using scalar multipliers. For 
the computation of a scalar multiplication, a scalar multiplier would add up to 
k/2 numbers per clock cycle when employing Booth recoding and k copies when 
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using no recoding. Assuming that all the operands in Algorithm 1 are of the 
same size, the concurrent computation of step 4.6 would require the addition of 
k + \ operands when using recoding or 2fc + 1 when using no recoding. On the 
other hand, when using precomputation the concurrent computation of step 4.6 
requires the addition of s + u + 1 operands. 

A limiting factor in the practical implementation of multiplication with pre- 
computation is the size of the memory required to store the precomputed values. 
The use of Booth recoding in Algorithm 1, reduces the memory requirements by 
half when no storage is provided for values known to have zero value (e.g., 0* A). 
Assuming that each precomputed product used in the computation of ABi re- 
quires m -I- k{d -I- 1) -I- r-bits of storage, that each precomputed scalar product 
used in the computation of Qa^ requires m + u-bits of storage, and that each 
processing unit stores its own set of precomputed values. Algorithm 1 requires 
s2’’“^(to -I- k{d -I- 1) -I- r) -I- u2““^(m -|- u) -bits of storage. Note that if multiple 
reduction methods are used concurrently, such as the use of a reduction method 
with d yf 0 and one with d = 0, more than one copy of reduction coefficients 
needs to be stored. 

Note that the relationships between r and s, and, between u and v, allow 
designers to control the memory size; for example, to achieve a given k, a designer 
could fix r and then derive s, which defines the required number of processing 
elements. This approach is particularly attractive for architectures that employ 
fixed size memory elements, such as field programmable gate arrays (FPGAs). 



6.5 Effect of Quotient Pipelining 

Quotient pipelining is the technique that allows fast rate of computations by 
allowing the use of delayed reduction terms (d yf 0). The delay is reflected in 
steps 4.4 and 4.6 of Algorithm 1. The computation in step 4.4 occurs in the 
background and takes d iterations to complete. To avoid stalling, the results 
from step 4.4 are consumed as they become available in step 4.6. 

The cost of this technique is reduced accuracy, increased processing time and 
increased area. The impact of this technique can be reduced by eliminating pro- 
cessing functions associated with the quotients, such as recoding, and by hiding 
quotient operations behind other functions. For example, the scalar products in 
step 4.6 of Algorithm 1 could be computed serially with all the processing en- 
gines dedicated to the computation of a scalar multiplication, instead of having 
two sets, each working on a different scalar product. 



6.6 Number Representation 

The previous discussion considered the upper layers of the ECP architecture, 
which are independent of the number representation. This section considers the 
specific example of stored-carry representation. 

Stored-carry representation is attractive for the implementation of an ECP, 
among others, because of its support for fast addition using carry-save addition. 
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natural interaction with non-redundant number representation, and its ability 
to support two’s complement arithmetic. 

The main drawbacks of stored-carry representation stem from its represen- 
tation of a number with two numbers; for example, A = C + S, where A, B, 
and C are numbers of almost equal size. This representation doubles the storage 
requirements for an operand with respect to non-redundant representation and 
makes comparisons difficult. A comparison can be carried out by performing 
a subtraction, converting the result to non-redundant representation and then 
comparing the result against zero. 

The use of Booth recoding in Algorithm 1 alleviates the storage requirements 
imposed by stored-carry representation. In addition, the ability to amortize pre- 
computations over a large number of operations can be used to reduce memory 
requirements by storing precomputed values in non-redundant representation. 
The ECP’s multiplier architecture, shown in Figure 2, also makes provisions 
for the conversion of numbers to non-redundant representation; for example, the 
conversion of B can be done in a digit-by-digit basis before recoding. In addition, 
the system could employ a carry propagate adder for the conversion of numbers 
to non-redundant representation before storing them in the register file. 

7 Multiplier Architecture 

The All’s architecture is shown in Figure 2. The multiplier and adder together 
implement Algorithm 1. The adder, which is optimized for accumulation {A = 
A + B), feeds precomputation values to the multiplier. Both the adder and the 
multiplier receive one of their inputs from the register file. They also output 
results to the register file. 

To accomplish a high rate of computation, the architecture shown in Figure 
2 can be implemented using stored-carry representation. To balance storage and 
processing speed requirements one can choose to represent some numbers in 
stored-carry representation and others in non-redundant representation. 

The reduction terms | ia2“^ | ^ and some of the temporary results can be 
converted to non-redundant representation before storage. Operand B of the 
multiplication can be loaded to the multiplier in stored-carry form and then 
converted to non-redundant representation one digit at a time as the loop in 
Algorithm 1 progresses. The reduction terms Qt can also be converted to non- 
redundant representation before applying Booth recoding. 

To support stored-carry representation, the architecture in Figure 2 must 
be enhanced with a carry propagate adder and with an efficient way to store 
numbers represented in stored-carry representation. For the ECP prototype de- 
scribed in the next section, we implemented a carry propagate adder with a 
digit-serial adder placed at point (A) in Figure 2. For the storage of numbers 
represented in stored-carry representation, we recommend that the output mul- 
tiplexer in Figure 2 be able to independently forward to the register file each of 
the numbers used to represent a number in stored-carry representation; that is, 
for A = C -I- S', this multiplexer can send either C or S to the register file. 
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Note that the two numbers used to represent a number in stored-carry rep- 
resentation can be treated as two numbers represented in non-redundant rep- 
resentation. Therefore, for the terms represented in stored-carry representation, 
such as A[ \blis+j\ ]{sign{blis+j), one can use two processing units per term. 
Where each processing unit handles numbers represented in non-redundant rep- 
resentation. This design approach allows the use of a common processing unit 
architecture for stored-carry and non-redundant number representations. 



8 Prototype Implementation 

The validity of the ECP architecture was verified with a prototype that im- 
plemented the double- and- add algorithm using the projective coordinates algo- 
rithms defined in [18] for point addition and point double operations (the algo- 
rithms are shown in the appendix). This prototype was programmed to support 
the field — 2®^ — 1), which is one of the fields specified in [17]. 

To verify the ECP’s architectural scalability to larger fields, a modular mul- 
tiplier for fields as large as GE(2®^^ — 1) was also prototyped. This field is the 
biggest one recommended in [17] for elliptic curves defined over GF(p). This 
prototype exhibits the same area scalability and frequency of operation as does 
the multiplier of the ECP prototype. The following discussion focuses exclusively 
on the ECP prototype. 

The ECP prototype used a 16-bit MC processor with 256 words of program 
memory, a 32-bit AUC processor with 2048 words of program memory, and a 
dual set of 128 registers, each of which is m -I- k{d + 2) bits wide. The dual set 
of registers permits the storage of numbers in stored-carry representation. (Note 
that a single register set capable of storing stored-carry numbers could have been 
used instead.) The prototype provided a 32-bit I/O interface to the host system. 
The ECP multiplier exhibits the following attributes: s = v = 2, r = u = 4, 
k = 8, m = 192, and d = 4. 

The ECP prototype for GE(2^®^— 2®^— 1) uses 11,416 LUTs, 5,735 Flip-Flops, 
and 35 BlockRAMS. LUTs are lookup-tables that are used in the prototype as 
16xl-bit RAMs or as 4-input gates. The BlockRAMs are dual-ported 4k-bit 
blocks of RAM, which are used in the register file, in the MC and AUC as 
program memory, and in the multiplier as Booth recoders. The frequency of 
operation of the prototype was 40 MHz. (The frequency of operation of the 
521-bit multiplier was 37.3 MHz.) 

The validity of the prototype was verified with non-optimized code. Assum- 
ing that the ECP is coded in a form that extracts 100% throughput from its 
multiplier, it will compute a point multiplication for an arbitrary point on a 
curve defined over GF(2^®^ — 2®^ — 1) in approximately 3 msecs (n = 192/8 -I- 1, 
d = A, k = rs = uv = 8) using the algorithms shown in the appendix. This 
estimate ignores the processing cost of additions and overhead operations and it 
assumes the computation of 17m multiplications per point multiplication: 15.5m 
for the point double and the point add operations and 1.5m for the inverse re- 
quired in the conversion to affine coordinates. For the modular multiplications. 




A GF{p) Elliptic Curve Processor Architecture for Programmable Hardware 361 



this estimate assumes negligible precomputation cost for the reductionj^erms, 
Qa^, and assumes the precomputation of 2^ — 2 values for the terms ABi (no 
computation required for OA or lA). 

The prototypes were implemented using the Xilinix’s XCV1000E-8-BG680 
(Virtex E) FPGA. The prototypes were coded in VHDL. They were synthesized 
with Synopsis’ FPGA Gompiler 3.5.0 and Xilinx’s Design Manager M3.1i. 



8.1 Comparisons with Other Implementations 

Table 3 summarizes the features of the multiplier used in the EGP prototype 
and the features of one of the multiplier architectures introduced in [10] which 
also relies on precomputation. Both of these multipliers exhibit comparable area 
requirements (^LUTs), when one assumes s = u = 1 and r = u = A. Note that 
the multiplier in [10] uses a fixed value of fc, where this value is highly dependent 
on the underlying FPGA architecture. 

It should be pointed out that the multiplier architecture introduced in [10] 
can be enhanced with some of the techniques introduced here. For example, 
to overcome the radix limitation, currently fixed at 2^, this multiplier could 
employ multiple processing engines per cell (s,u yf 1), and to reduce memory 
requirements it could use Booth recoding. 



Table 3. ECP multiplier vs. Design 2 multiplier [10] 



Characteristic 


ECP 


Design 2 (k — 4) [10] 


Type 


semi-systolic 


systolic 


Main Application 


Elliptic Curves 


Exponentiation 


Basic Operation 


ABR-^ ^ 


ABR-^ y. & ACR-^ ^ 


Throughput (mult . / #clks) 


l/([m/fc] -1- 2d) 


2/(2[m/fcl) 


Latency (#clks) 


< [m/fc] -I- 2d 


2[m/fc] 


Accuracy 




2(2*=)M 


Max. Radix 


2VS 


2^ 


#LUT 


(2 + 4(2s + + k[d + 1)) 


12m 


# Flip-Flops 


(2 -I- 2s -I- v){m + k{d + 1)) 


12m 


Frequency (MHz) 


40 


48 


FPGA 


XCV1000E-8-BG680 


XG4000 



9 Conclusions 

This work proposed a new elliptic curve processor architecture for the computa- 
tion of point multiplication for curves defined over fields GF{p). This processor 
uses a new type of high-radix Montgomery multiplier that relies on the precom- 
putation of frequently used values and on the use of multiple processing engines. 

The EGP’s architectural scalability was verified with prototype implemen- 
tations suitable for the implementation of secure elliptic curve cryptosystems 
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(192- and 521-bits). Our estimates reflect that if were possible to extract 100% 
throughput from our multiplier, the computation of a point multiplication in a 
curve defined over GF(2^®^ — 2®"^ — 1) could be computed in about 3 msecs using 
the double- and- add algorithm and the projective coordinates algorithms defined 
in [18]. 
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A Elliptic Curve Point Multiplication 



Algorithm 2: Double-and-add point multiplication using the projective coordinates algorithms 
defined in [18] 



double_and_add(x, y, k) 


add(Xo,yo,^o,Ai,Yi,Zi) 


(X, y, Z) = conv_projective(x,t/) 


/* if Pi = O then return Pq*/ 


(Xo,Yo,Zo) = (A,y,Z) /* Po = P^/ 


if (Xi,yi,Zi) = O then 


for i = 1 — 2 down to 0 do 


return(Xo, Yo,Zq) 


(A, y, Z) = double(A, Y, Z) f* P = 2P */ 


/* else if Po = -Pi then return O */ 


if ki = l then /* P = P + Pq */ 


else if (Xo,yo,Zo) = -(Yi,yi,Zi) then 


(X, y, Z) = add(Xo, yo, Zo, A, y, Z) 


ret urn (O) 


end if 


/* else if Pq = Pi then return 2 Pq*/ 


end for 


else if (Xo, yo, Zo) = (Xi, n, Zi) then 


(a;, y'j = conv_affine(A, y, Z) 


(X2,y2,Z2) = doubie(Xo,yo,Zo) 


return (x,y) 


else /* return P 2 = Po + Pi */ 


double(Xi, Yi.Zi) 


Uo = XoZ( 


/* if P = 0 then return O */ 


So = YoZl 


if (Xi ,Yi, Zi) = O then return(O) 


Ui =XiZ2 


else !* P AO return 2P */ 


5i =yiZ3 


M = 3X( -f- aZf 


W = Uo-Ui 


Z2 = 2YiZi 


R= So -Si 


S = 4XiY^ 


T = Co -b Ui 


X2 = M^ - 25 


M = 5o -1 5i 


T = 8Y(- 


Z 2 = ZqZiW 


Y2 = M{S -X2)-T 


X2 = R? - TW^ 


endif 


V = TW^ - 2 X 2 


return(X 2 , Y 2 , ^ 2 ) 


2 X 2 = VR- MW^ 


conv_projective(a:, y) 


endif 


return {X = x^Y = y^ Z = 1) 


return(X 2 , y 2 , Z 2 ) 


conv_affine(A, y, Z) 
return(a; = XjZ'^^y = YjZ^) 
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Abstract. We proposed a fast parallel algorithm of Montgomery multi- 
plication based on Residue Number Systems (RNS). An implementation 
of RSA cryptosystem using the RNS Montgomery multiplication is de- 
scribed in this paper. We discuss how to choose the base size of RNS 
and the number of parallel processing units. An implementation method 
using the Chinese Remainder Theorem (CRT) is also presented. An LSI 
prototype adopting the proposed Cox-Rower Architecture achieves 1024- 
bit RSA transactions in 4.2 msec without CRT and 2.4 msec with CRT, 
when the operating frequency is 80 MHz and the total number of logic 
gates is 333 KG for 11 parallel processing units. 

Keywords: RSA cryptography, residue number systems, Montgomery 
multiplication, modular exponentiation 



1 Introduction 

Computational performance of large integers is important in the implementa- 
tion of public key cryptography and digital signature. We proposed a fast par- 
allel Montgomery multiplication algorithm based on Residue Number Systems 
(RNS) [1]. In RNS, an integer is represented by a set of its residues in terms 
of base elements of RNS, and thus addition, subtraction, and multiplication 
can be independently carried out for every base element. On the other hand, 
Montgomery multiplication is a method for performing modular multiplication 
by substituting addition and multiplication for division. Therefore, the combi- 
nation of RNS and Montgomery multiplication is expected to be well suited to 
parallel processing of modular exponentiation, and several studies concerning it 
have been reported [2], [3], [4]. 

The main purpose of our previous paper [1] was to improve the base trans- 
formation algorithm which consumes most of the processing time in the RNS 
Montgomery multiplication. We also proposed a hardware “Cox-Rower Archi- 
tecture” suitable for the RNS Montgomery multiplication. The base transforma- 
tion operation is efficiently realized by the Cox-Rower Architecture where Rower 
units perform parallel processing in cooperation with one Cox unit. Based on 
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this architecture, the performance of 1 Mbps has been estimated for 1024-bit 
RSA cryptosystem at the operating frequency of 100 MHz. 

In this paper, we investigate an implementation of RSA cryptosystem using 
the proposed RNS Montgomery multiplication algorithm, and design an RSA LSI 
to confirm the performance and feasibility of the proposed algorithm. As imple- 
mentation methods, RSA decryption procedures without and with the Chinese 
Remainder Theorem (CRT) are presented. The Cox-Rower Architecture is char- 
acterized by the scalability for operating time and chip size depending on the 
number of Rower units. In implementation, the relation between the number of 
Rower units and the base size in RNS representation becomes important for the 
performance, because operations for each base element are performed in paral- 
lel at Rower units. For an LSI prototype using 0.25 /rm CMOS, we obtain 4.2 
msec for 1024-bit RSA cryptosystem without CRT and 2.4 msec with CRT. This 
result is comparable with the present best performance of commercial chips. 

The organization of the paper is as follows: In the next section, the RNS 
Montgomery multiplication algorithm proposed in Ref. [1] is surveyed. In Sec. 3, 
we present RSA decryption procedures without and with CRT, and discuss the 
base size of RNS and the number of parallel processing units from the viewpoint 
of implementation. In Sec. 4, for the designed LSI, a hardware structure and its 
specifications are described. Finally, a short summary is given in Sec. 5. 



2 Algorithm 

2.1 Residue Number Systems 

In RNS, an integer x is represented by 

{^)a = (a;[ai],a:[a2],...,a;[a„]), (I) 

where x[ai] = x mod at. The set a = {oi, 02 , . • . , a„} is called a base and 
the number of elements n is its base size. The elements are required to sat- 
isfy gcd(aj,aj) = 1 for t j. CRT assures that the integer x which satisfies 
0 < X < A (A = n”=i tti) is uniquely represented by (x)^. 

The RNS representation has an advantage in which addition, subtraction, 
and multiplication can be realized by modular addition, subtraction, and multi- 
plication for each RNS element as follows: 

± y)a = ((a^[«i] ± y[ai])[ai], ■■■, (a:[a„] ± y[a„])[a„]) , (2) 

{x ■ y)a = ((a:[ai] ■ y[ai])[ai], . . . , (x[a„] • i/[a„])[a„]) , (3) 

which enables parallel computation using n processing units. However, we have 
not known how to perform comparison and division efficiently based on the RNS 
representation. To overcome this disadvantage, combination with Montgomery 
multiplication has been proposed [1], [2], [3], [4]. 
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2.2 Montgomery Multiplication 

Montgomery multiplication is known to be an efficient method for implementing 
modular exponentiation used in public key cryptographies. In the algorithm 
shown below, the inputs are x, y, and N {x,y < N) and the output is w = xyR~^ 
(mod N) {w < 27V), where gcd(i?, TV) = 1 and N < R. 

1: s <— X ■ y 

2: t ^ s ■ (— 7V“^) mod R 

3: u ^ t ■ N 
4: V ^ s + u 

5: w <— v/R 

The Montgomery constant R is chosen so as to make division in steps 2 and 5 
simple. For example, R is generally set to 2’s power in a radix 2 representation. 

It is characteristic of Montgomery multiplication to perform modular mul- 
tiplication by substituting addition and multiplication for division. Since the 
advantage of RNS is that addition, subtraction, and multiplication can be inde- 
pendently performed for each RNS element, the combination of RNS and Mont- 
gomery multiplication is expected to realize fast parallel processing effectively. 



2.3 RNS Montgomery Multiplication 

The RNS Montgomery multiplication algorithm proposed in Ref. [1] is briefly 
described in this subsection. The above-mentioned Montgomery multiplication 
procedure is rewritten by using RNS as shown in Fig. 1. Two bases a and b are 
introduced, and B {— n”=i is used as the Montgomery constant. We assume 
here both a and b have the base size n, and denote the RNS representation 
of X based on a and b by or simply by (x). The bases a and b satisfy 

A,B > 87V, gcd(A, i?) = 1, and gcd(R,7V) = 1. 



Function: , N) 



Input: {x,y < 2N) 

Output: (w = xyB~^ (mod TV), w < 27V) 



Base a operation 


Base b operation 


1 


(S)a ^ {Xy)a 


{s)b ^ {xy)b 


2 




(t), ^ (s(-TV-i))^ 


3 


(t)„^BT((t),,0) 


4 






5 


Ma ^ (s + M)a 




6 






7 


{w),^BT{{w)^,0.5) 



Fig. 1. RNS Montgomery multiplication algorithm. 
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Steps 3 and 7 in Fig. 1 are base transformation (BT) between a and b (see 
Fig. 2). According to CRT, x in radix representation is calculated from (x)^ by 



n 

X = ^iAi mod A, 



(4) 



where = x[ai]A^ ^[a^] mod ai and Ai = Ajai. Equation (4) is rewritten by 



= ( X! 



(5) 



with an unknown parameter k. Dividing both sides of Eq. (5) by A, we obtain 



k = 



E 



k 

(Xi 



(6) 



from 0 < x/A < 1 and k < < fc + 1. Figure 2 shows a procedure of 

the base transformation from a to b. In this procedure, k is approximated by 



k = 



E 



trunc(g^) 

2r 



H” o; , 



(7) 



Li=l J 

where trunc(^i) is a function to approximate by its most significant g {< r) 

g (r-g) 



bits: i.e. trunc(^i) = /\(1 ...10... 0)(2), A means a bitwise AND operation, 

and r is the bit length of processing units. An offset value a is required as a 
correction caused by the approximation, and is set to 0 at step 3 and 0.5 at step 
7 in Fig. 1. The parameter k is computed recursively by ki as shown at steps 4-6 
in Fig. 2, where ki satisfies k = Xr=i ^ 



Function: {x)^^ = BT((a:)^ , a) 

Input: {x) a = 0 or 0.5 
Output: 

Precomputation: (Aj)^, (— A)^ 

1: = a:[ai]A“^[ai] mod ai 

2: o-Q = a, yio = 0 

3: For j = 1, . . . ,n 

4: cTj = uj-i + trunc(^(i+i_j))/2’’ 

5: = [e'iJ 

6: (Jj = CTj — 

7: yij = [6i] -|- • (— A)[6i] 

8: Next j 

9: x[bi] = j/i„ mod 6; 



Fig. 2. Base transformation algorithm. 
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Function: = MEXP((a:)^^^ , d, A^) 

Input: (a;)^ui,- • • • ’ '^i)( 2 ‘‘) 

Output: (y)j,u 5 , s.t. y = mod N 

Precomputation: s.t. Bn = B mod N 

1: {x%'j t— (Bn) 

2: (x^) ^ (®) 

3: (x‘^i)^MM((a:V),(a;),iV) (for i = 1, . . . , 14) 

4: (j/) ^ (x%^) 

5: For i = k — 1, . . . , 1 

6: For ji = 1, . . . , 4 

7: {y)^MMi{y),{y),N) 

8: Next j 

9: (2/)^MM((y),(x^),iV) 

10: Next i 

Fig. 3. RNS modular exponentiation algorithm. 

As compared with the previous RNS Montgomery multiplication algorithm 
[2], the above-mentioned algorithm has an advantage in that the base transfor- 
mation at step 7 in Fig. 1 is error-free and does not need extra steps for error 
correction. Moreover, the correction factor ki is computed only by an adder as 
will be described in Sec. 4.1, which can make the hardware structure simpler 
than that in Ref. [2]. 

An exponentiation algorithm based on a 4-bit window method is realized by 
the RNS Montgomery multiplication as shown in Fig. 3. It is assumed that an 
input variable has been transformed previously into x' = xB mod N , because of 
the essential feature of the Montgomery multiplication in which the Montgomery 
constant B is introduced. From this assumption, we obtain y = x'^B mod N as 
an output. 

The clocks to perform the RNS Montgomery multiplication are 0{n? /u), 
where u is the number of parallel processing units. The RNS modular exponen- 
tiation (MEXP) is realized as the iteration operation of the RNS Montgomery 
multiplication (MM), and the number of iterations is proportionate to the key 
length \N\. From n oc |7V|, the performance of the RNS modular exponentiation 
is consequently estimated by 0{n^/u). This relation means that there is the 
scalability for performance and chip size depending on the number of parallel 
processing units u, since the chip size is determined by u. 

3 Implementation 

Figure 4 shows an RSA decryption procedure using the RNS modular expo- 
nentiation algorithm. In steps I and 2, modular arithmetic based on the radix 
2 representation is required. We assume that this modular arithmetic is per- 
formed at a dedicated divider unit. In step 3, (— which is used in MM() 
is calculated from bi — mod bi), where Aj, the Carmichael function [5] of 
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Input: C, d, N 
Output: m = C"^ mod N 
1: Bm t— B mod N 

2: B% t— B^ mod N 

3: Compute 

4: Radix-RNS conversion: N, Bn, B%, C 

5: (C') ^MM((C7),(R?,),iV) 

6: (m') ^ MEXP((C") , d, N) 

7: (m) ^ MM((m'),(l),iV) 

8: RNS-Radix conversion: (m)^ 

9: m m — cN (c = 0 or 1) 

Fig. 4. RSA decryption algorithm. 



bi, is precomputed and stored in ROMs. Since steps 1-4 (except for C at step 
4) depend only on the key N, it is effective to precompute these steps, if N 
is not changed frequently. As mentioned above, it is assumed in MEXP() that 
the input and the output are variables multiplied by the Montgomery constant 
B. From this condition, steps 5 and 7 are required as a transformation to get 
C = CB mod N and as its inverse transformation, respectively. Finally, step 9 
is a correction to assure m < N, because m obtained in step 7 is less than 2N. 

Radix-RNS and RNS-Radix conversions are defined by 



= I X! ) mod Gi, 

,j=o 








( A,(n-1)^, 




( A(n-l)^ 






^.(1) 


- ki 


A(l) 






^ ^*(0) ) 




V ^(0) ) 





(8) 



, (9) 



where the notation x{i) means the radix-2’’ representation of x: i.e. x= X^r=o ^(*)‘ 
In the RNS-Radix conversion, carry propagation is needed after the sum- 
mation has been finished. 



3.1 Base Size and Number of Parallel Units 

The operation shown in Fig. 4, except for steps 1 and 2 performed at the addi- 
tional divider unit and the carry propagation in the RNS-Radix conversion, is 
independently carried out for every base element Ui and bi at parallel process- 
ing units. Here, the base size n has the relation n > [(|N| -|- r)/r] with the bit 
length \N\ of the modulus N. The number of parallel processing units u can be 
chosen in the range of 1 < m < n. If u < n, time-sharing processing for some 
base elements is performed in each unit. When the base size n is fixed, RSA 
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transaction performance improves in proportion to u. Obviously, it is efficient 
to set w as a divisor of n in order to control all processing units by the same 
procedure. By choosing u appropriately, a variety of chips can be realized in 
terms of performance and size. 

In the implementation, it is realistic that all parameters which depend only 
on the base sets a and b are precomputed and stored in ROMs. A chip loaded 
with base sets for an RSA key length L can deal with key lengths which are 
shorter than L. However, processing time of a key length I (< L) is reduced 
only to IjL as compared with that of the key length L, although (l/L)^ is 
achieved ideally. The overhead time is caused by the fact that the performance 
of RNS Montgomery multiplication is determined from the base size, and thus 
the amount of operations does not decrease. 

In order to perform an efficient computation for shorter key lengths, it is 
necessary to prepare some base sets for typical key lengths: e.g. 512, 1024, and 
2048 bits. For these key lengths, minimum base sizes become 17, 33, and 65 
in the case of r = 32. There are several implementation methods to deal with 
different-sized base sets. Among them, it is advantageous to set a base size to a 
multiple of u from the viewpoint of the simplicity of a control circuit. Therefore, 
if u = 11, appropriate base sizes are 22, 33, and 66 for key lengths 512, 1024, 
and 2048 bits, respectively. In this case, it is expected that 1024-bit and 2048-bit 
RSA processing has good performance, whereas 512-bit processing has overhead 
time. 

3.2 CRT Mode 

An RSA decryption procedure with CRT is given by 

m = mod p){q~^ mod p) q+ mod q){p~^ mod q) p (mod N) 

= mod p){q~^ mod p) mod p] q 

+ mod q){p~^ mod q) mod q] p (mod N), (10) 

where N = pq, dp = d mod (p — 1), and dq = d mod {q — 1). A procedure to 
perform Eq. (10) is shown in Fig. 5. The operations for p and q are carried out 
sequentially. Precomputations of steps 1, 2, 4, and 5 (except for Cp and Cq) are 
effective, if the secret keys p and q are not changed frequently. Here, it should be 
noted that we need the RNS representation of m by means of the base a U 6 in 
the RNS-Radix conversion at step 11, because the modulus N in RSA processing 
with CRT is represented uniquely not by a single base a or b but by the base 
aUb. In contrast, only (m)^ (or (m)j,) is sufficient for the RNS-Radix conversion 
in Fig. 4. Step 12 is a correction to assure m < N the same as step 9 in Fig. 4. 
In this case, m obtained in step 10 is less than 4A^, because Up and Uq (< 2N) 
are added to each other without mod N operation. 

The processing time of Fig. 5 is dominated by MEXP() at step 7. Since 
the base size n can be reduced to n/2 by adopting CRT, the processing time 
of MEXP() becomes 1/8 of that in Fig. 4. As a result, reduction of about 1/4 
is achieved in total processing time, because the operations for p and q are 
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Input: C, dp, dq, N, p, q, gjnv(= q ^ mod p), Pinv(= P ^ mod q) 
Output: m = C"^ mod N 



Operation for p 


Operation for q 


1 


Bp -h- 


B mod p 


Bq B mod q 


2 


Bp -<r- B^ mod p 


Bq ■(— B^ mod q 


3 


Cp ^ 


C mod p 


Cq C mod q 


4 


Compute p 


Compute 


5 


Radix-RNS: p, Bp, Bp, Cp 


Radix-RNS: q, Bq, Bq, Cq 


6 


(^C'p) 


^MM{{Cp) ,{Bl) ,p) 
^MEXP{{C'p) ,dp,p) 


(C'q) ^MM{{Cq) , {Bl) ,q) 


7 


(m'p) 


{m'q)^MEXP{{C'q),dq,q) 
(tq) ^ MM((m^) , (Pinv) . ?) 


8 


(tp) <r 


- MM((mp) , (<7inv) >P) 


9 


(Up) C 


- M\]L{{tp) , (q)) 


{Uq} ^ MUL((t,) , (p)) 


10 




(m) -h- ADD((up) , (uq)) 


11 




RNS-Radix conversion: 


12 




m m — cN (c = 0, 1, 2, or 3) 



Fig. 5. RSA decryption algorithm with CRT. 



performed sequentially. The same reduction ratio is obtained in a general case 
based on the radix 2 representation. 

4 Prototype 

We prototyped an LSI adopting the Cox-Rower Architecture. In this section, an 
outline of the LSI is described. 

4.1 Architecture 

The Cox-Rower Architecture was proposed as a hardware suitable for the RNS 
Montgomery multiplication [1]. The name is derived from its original structure 
where plural “Rower” units perform parallel processing in cooperation with one 
“Cox” unit which computes a correction factor in the base transformation. 

A hardware structure of the Cox- Rower Architecture in this work is shown in 
Fig. 6. It consists of u sets of Rower units which individually have a multiplier- 
and-accumulator with modular reduction unit by base element and bi. Figure 
6 is different from the original structure proposed in Ref. [1] in regard to the 
following two points: 

(i) Rower units are connected by ring connection instead of by bus connection. 

(ii) A Cox unit is embedded in every Rower unit. 

In the base transformation, (i = 1, ... ,n) which has been computed in each 
Rower unit needs to be transferred to the other Rower units. The original archi- 
tecture uses bus connection for this transfer. We have found that ring connection 
can also realize the transfer of ^i’s by sending them to an adjoining Rower unit 
in turn. In addition, since the original architecture has only one Cox unit, it also 
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Fig. 6. Cox-Rower Architecture. 




Fig. 7. Multiplier-and-accumulator (a) and Cox unit (b), where r = 32, 2 = 72, and 
5 = 9. 

needs bus connection to broadcast the correction factor kt, which is computed 
in the Cox unit, to all Rower units. This broadcast is, however, avoidable by 
embedding a Cox unit in each Rower unit as shown in Fig. 6, which further 
enables us to control all Rower units by the same procedure. Consequently, we 
have adopted the ring connection in this work to lower data driving load and 
improve the modularity of Rower units. 

Structures of the multiplier-and-accumulator and the Cox unit are shown in 
Fig. 7. The multiplier-and-accumulator has two stages: one is to accumulate a 
result of multiplication-and-addition and the other is to perform modular reduc- 
tion by the base elements. The Cox unit consists of a truncation unit, a g-hit 
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Fig. 8. Modular reduction unit, where h = 10. 



adder, and a register in order to compute ki in the base transformation according 
to steps 4-6 in Fig. 2. One of the advantages of the proposed RNS Montgomery 
multiplication algorithm is in this simple structure of the Cox unit. 

Here, let us consider the base transformation procedure at Rower unit i. 
First, at step 1 in Fig. 2, is calculated from x[ai] and A~^[ai] and is stored 
in the register R2. Next, in the loop for j = 1, ki is computed from at the 
Cox unit, and then yn is obtained and stored in the register Rl. Before the next 
loop for j = 2, ^i-i which has been computed at Rower unit i — 1 is transferred 
by ring connection and is stored in R2. Then, the loop for j = 2 is carried out 
based on ^i_i, and yi 2 is obtained. After all loop processes for j = 1, . . . , n have 
been finished, x[bi] is computed from at step 9 and is stored in R2. 

Figure 8 shows a structure of the modular reduction unit in Fig. 7(a). The 
base elements ai and bi can be given as 2’’ — fii and — Vi. Small integers y,i 
and Vi (<C 2’’) are chosen so as to make the base elements coprime. In this case, 
modular computation for ai and hi can be realized by multipliers and adders, 
where the multipliers perform multiplication by fii and Vi as shown in Fig. 8. 
The maximum bit length h of and Vi is 10 bits for the base size n = 66. With 
respect to the output y of the operation x mod Oj, the modular reduction unit has 
the condition y < 2’’ instead of y < ai, i.e. if x mod ai < yi, y = x mod Ui + ai. 
We have ascertained that this condition does not affect the RNS Montgomery 
multiplication algorithm. 

The Cox-Rower Architecture additionally has a divider which performs divi- 
sion based on the radix 2 representation. This divider is required for steps 1 and 
2 in Fig. 4 and steps 1-3 in Fig. 5. 
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As described above, the Cox-Rower Architecture is designed to be suitable 
for the RNS Montgomery multiplication algorithm, particularly for the base 
transformation algorithm. The other operations such as the Radix-RNS and the 
RNS-Radix conversions can also be implemented efficiently in this architecture. 



4.2 Specifications 

The specifications of the LSI prototype are summarized in Table 1. In the LSI, 
the standard length of 32 bits is adopted as the bit length of processing units r. 
The number of Rower units u is set to 11 from the consideration for chip size. 
We can use the base sizes of 22, 33, and 66, which realize maximum key lengths 
of 672, 1024, and 2080 bits, respectively. Therefore, key lengths up to 2048 bits 
without CRT and 4096 bits with CRT are available. 

SHA-1 which is required as a Hash function in digital signature is additionally 
implemented in the LSI. The SHA-1 core has an MGFl function used in RSA 
standard spec PKCS#1 Ver.2.0 [6]. 

In Rower unit i (i = 1, . . . , 11), operations for the base elements aj and hj 
are performed, where j = i+lli (£ = 0, . . . , 5). Thus, parameters in terms of aj 
and bj are stored in ROM of Rower unit i. Table 2 lists the parameters. These 
parameters for the three base sizes n = 22, 33, and 66 are prepared in the LSI, 
which needs the memory size shown in Table 3. In this table, memory sizes in the 
case of shorter maximum key lengths are also estimated. Since the LSI has been 
designed to provide long key lengths such as 2048 and 4096 bits, the increase 
in memory size causes a big core size. It is possible to reduce the memory size 
depending on maximum key lengths as shown in Table 3. 

Figure 9 shows the details of the processing time. The transactions of I/O 
and precomputation for keys are negligible in the total processing time and 



Table 1. Specifications. 



Process 


0.25 lira CMOS 


Operating frequency 


80 MHz 


Operating voltage 


2.5 V 


Functions 


RSA without and with CRT 
SHA-1 Hash code generation 
MGFl (PKCS#1 Ver.2.0) 


Performance 


1024-bit RSA: 4.2 ms / 2.4 ms 


(without/with CRT) 


2048-bit RSA: 29.2 ms / 8.9 ms 
4096-bit RSA: — / 60.4 ms 


Core size 


6.9 mm x 6.9 mm 


No. of Rower units 


11 


No. of logic gates 


333 KG (Total) 

221 KG (RSA core) 
36 KG (Divider) 
57 KG (SHA-1) 

19 KG (I/O etc.) 
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Table 2. Parameters stored in ROM. 



Base transformation 

(®)a Mbj], ■■■ , A^bj], -A[bj] 

i^)b (^)g -^7 fe]; Bi[aj], ... , B„[aj], —B[aj] 

Radix- RNS conversion 

^ ^ (a^)6 2-fe], 2-^N, ...,2-(-^)fe] 

RNS-Radix conversion 

(x)a ® -4i(j-l), . . . , A„(j-1), 

(x)aub * (CRT mode) {AB/aj)~^[aj], 

(AB/ai)(j-l), ... , (AB/a„)(j-l), 

(AB/&i)(j-l), ... , (AB/6„)(j-l), -ABjj-1) 



Table 3. Memory size. 



Maximum RSA key length 


ROM (KByte) 


RAM (KByte) 


2048 & 4096 (CRT) * 


209 


24 


2048 & 2048 (CRT) 


138 


20 


1024 & 2048 (CRT) 


57 


12 



* Designed LSI 



are not exhibited in this figure. It is found that the contribution from division 
based on the radix 2 representation becomes large in the processing of shorter 
key lengths and in CRT mode. The latter condition means that the number 
of parameters which are computed in the divider increases in CRT mode. In a 
comparison between the performance without and with CRT, reduction ratio of 
0.3 is obtained in 2048-bit RSA processing, which is close to an ideal ratio of 
1/4. However, reduction ratio becomes 0.5 in 1024-bit processing. This increase 
in reduction ratio is due to the use of a redundant base size in CRT mode. In 
non-CRT mode of 1024-bit processing, we use the base size n = 33 which is 
optimum for this key length. In contrast, the base size n = 22 used in CRT 
mode is too long for the key length of 512 bits. These results indicate that base 




Fig. 9. Performance. 
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sizes strongly affect the performance of the Cox- Rower Architecture as discussed 
in Sec. 3.1. 

At present, the best performance of 1024-bit RSA processing in commercial 
chips is, as far as we know, reported for Rainbow’s chip (5 msec with CRT) [7] 
and Pijnenburg’s chip (3 msec and 1.5 msec with CRT) [8], which is comparable 
with the performance in this work. The Cox-Rower Architecture can equip up to 
33 Rower units for 1024-bit RSA processing. In that case, three-times speedup 
can be realized and the processing time which is less than 1 msec becomes 
feasible. 

5 Conclusions 

This paper presented the implementation of RSA algorithm based on the RNS 
Montgomery multiplication. We showed RSA decryption procedures and dis- 
cussed the relation between the base size of RNS and the number of parallel 
processing units. The designed LSI adopting the Cox-Rower Architecture can 
deal with key lengths up to 4096 bits in CRT mode. Using 11 Rower units, we 
obtained 1024-bit RSA transactions in 4.2 msec without CRT and 2.4 msec with 
CRT, at the operating frequency of 80 MHz. This result gives us a prospect of 
realizing a high performance. Downsizing of chips and speedup by using more 
Rower units are subjects to be tackled in the next phase of this work. 
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Abstract. We propose several new methods to protect the scalar multi- 
plication on an elliptic curve against Differential Analysis. The basic idea 
consists in transforming the curve through various random morphisms 
to provide a non-deterministic execution of the algorithm. 

The solutions we suggest complement and improve the state-of-the-art, 
but also provide a practical toolbox of efficient countermeasures. These 
should suit most of the needs for protecting implementations of crypto- 
algorithms based on elliptic curves. 

Keywords. Public-key cryptography. Side-channel attacks. Differential 
power analysis (DPA), Timing attacks. Elliptic curves, Smart-cards. 



1 Introduction 

Since the introduction of the timing attacks [10] by Paul Kocher in 1996 and 
subsequently of the Differential Power Analysis (DPA) [9], the so-called side- 
channel attacks have become a major threat against tamper-resistant devices 
like smart-cards, to the point where the immediate relevance of classical security 
notions is somewhat questionable. Furthermore, numerous experiments show 
that most of the time, perfunctory countermeasures do not suffice to thwart 
those attacks. 

In the case of public-key cryptosystems based on the discrete logarithm on 
elliptic curves, the running time does not really represent a bottleneck for smart- 
card applications, which are equipped with additional devices for fast computa- 
tion in finite fields. Therefore, investigating the security of these applications, 
less constrained by performance criteria, against side-channel attacks is very 
relevant. 

Compared to the previous works of [5] and [7], this paper systematically 
develops the same idea: assuming that an elliptic curve cryptosystem executes 
some operations in the group of a curve if, the whole algorithm is transposed 
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to a curve 4>{E), where 4> is a, random morphism. The rich algebraic structure of 
elliptic curves enables numerous possible choices for such morphisms. 

The rest of this paper is organized as follows. In the next section, we provide 
a brief description of elliptic curves. We refer the reader to Appendix A for fur- 
ther mathematical details. The general principles of differential analysis and how 
this can reveal the secret keys of an elliptic curve cryptosystem are explained in 
Section 3. Next, we provide two main classes of possible morphisms to random- 
ize the basepoint. Finally, we present in Section 5 a new randomization of the 
encoding of the multiplier in the case of Anomalous Binary Curves (ABC) . 

2 Elliptic Curves 

Let K be a field. An elliptic curve over K is a pair {E, O) where if is a non- 
singular curve of genus one over K with a point O G if . It is well known that the 
set of points (x, j/) G IK x K verifying the (non-singular) Weierstrafi equation 

A/k : + a\xy + a^y = -I- 04 a; -I- oe {at G K) (1) 

together with O form an elliptic curve and that an elliptic curve can always be 
expressed in such a form. The point O is called the point at infinity. 

The set of points (x,y) satisfying Eq. (1) and O form an Abelian group 
where O is the neutral element. This group is denoted by if(IK) and the group 
operation is denoted by -f. The operation consisting in computing the multiple 
of a point, Q — kP := P + ■ ■ ■ + P {k times), is called (elliptic curve) scalar 
multiplication. We refer the unfamiliar reader to Appendix A for the required 
background on this particular topic. 

3 Differential Analysis 

In his CRYPTO ’96 paper [10] and thereafter in [9] with JafFe and Jun, Kocher 
launched a new class of attacks, the so-called side-channel attacks. 

The basic idea of the side-channel attacks is that some side-channel infor- 
mation (e.g., timing, power consumption, electromagnetic radiation) of a device 
depends on the operations it performs. For instance, it is well known that the 
modification of a memory state yields a different power consumption according 
to the memory goes from one to zero, or the opposite. 

By capturing this information, it may be possible to recover some secret keys 
involved during the execution of a crypto-algorithm, at least in a careless imple- 
mentation. When a single input is used in eliciting information, the process is 
referred to as a Simple Analysis and when there are several inputs used together 
with statistical tools, it is referred to as a Differential Analysis. In this paper, we 
are concerned with the second type of attack, and in particular in the context 
of elliptic curve cryptography. 

For elliptic curve cryptosystems, this type of attack applies to the scalar 
multiplication. Following [5], a simple countermeasure to defeat simple anal- 
ysis attacks resides in replacing the standard double- and- add algorithm by a 
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double-and-add-oZwa?/s algorithm for computing Q = kP on an elliptic curve 
(see also [7] for further countermeasures dedicated to ABC curves). However, 
such an algorithm is still susceptible to a differential analysis attack. Let k = 
{km-i, ■ ■ ■ ,ko )2 be the binary expansion of multiplier k. Suppose that an at- 
tacker already knows the highest bits, km-i, ■ ■ ■ , of k. Then, he guesses that 
the next bit kj is equal to one. He randomly chooses several points Pi, ... ,Pt 
and computes Qr = h)Pr for 1 < r < t. Using a boolean selection 

function g, he prepares two sets: the first set, 5true, contains the points Pr 
such that g{Qr) ~ true and the second set, 5faise, contains those such that 
g{Qr) — false. Depending on the side-channel information monitored by the 
attacker and the actual implementation, a selection function may, for example, 
be the value of a given bit in the representation of Qr. 

Let C(r) denote the side-channel information associated to the computation 
of kPr by the cryptographic device (e.g., the power consumption). If the guess 
kj = 1 is incorrect then the difference 

(C(^)) l<r<i ~ (C'(^)) l<r<t 
-Pf£‘5false 

will be « 0 as the two sets appear as two random (i.e. uncorrelated) sets; oth- 
erwise the guess is correct. Once kj is known, the remaining bits, . . . ,ko, 

are recovered recursively, in the same way. We note that such attacks are not 
restricted to binary methods and can be adapted to work with other scalar 
multiplication methods, as well. 

To thwart differential attacks, it is recommended to randomize the basepoint 
P and the multiplier k in the computation of Q = kP. Several countermeasures 
are already known. See [5] for general curves and [7] for ABC curves. The next 
section proposes two techniques for randomizing the basepoint and Section 5 
shows how to randomize the multiplier for an ABC curve. 

4 Randomizing the Basepoint 

4.1 Elliptic Curve Isomorphisms 

We first recall some results on isomorphisms between elliptic curves. We say that 
two elliptic curves over a field K defined by their WeierstraB equations E and 
E' are isomorphic over K (or 'K-isomorphic) if they are isomorphic as projective 
varieties. It turns out that curve isomorphisms induce group morphisms. The 
determination of isomorphisms between two given elliptic curves is solved in the 
next two corollaries. 

Corollary 1. Let K be a field with CharK yf 2,3. The elliptic curves given by 
E/^ : y'^ = + ax + b and E'^^ : y'^ = x^ + a'x + b' are 'K-isomorphic if and only 

if there exists u G K* such that u'^a' = a and u^b' = b. Furthermore, we have 

I Oe^O 

\(x,y) {u~'^x,u~^y) 



ip : E(K) ^ E'(K), 



(2) 
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and 



: E'{K) ^ E{K), 



0^0 

{x,y) H> {v?x,u^y) 



( 3 ) 



Proof. With the notations of Proposition 2 (in appendix), we obtain r = s = 



t = 0 and so u'^o'a = 04 and u^a'g = 



Qq 



u^a' = 



= a and u^b' = 



u G 



h for some 

□ 



Corollary 2. Let K &e o field with Char K = 2. The (non-supersingular) elliptic 
curves given by E : y"^ xy = x‘^ ax^ + b and E'^^ : y'^ xy = x^ a'x"^ + b' 
are K-isomorphic if and only if there exists s G K such that a' = a + s + and 
b' = b. Furthermore, we have 



and 



if : E(K) ^ E'{K), 



Oe^O 

(x,y) {x,y + sx) 



: £;'(K) ^ E{K), 



I Oe^O 

\(x,y) H> (x,y-l-sx) 



( 4 ) 

( 5 ) 



Proof. From Proposition 2, the relation ua'i = oi + 2s gives u = 1. The third 
and fourth relations give r = 0 and t = 0, respectively. Hence, from the second 
relation we have a '2 = 02 — s — s^ whereas the last one yields Oq = qq a' = 

a + s + s^ and b' = b. □ 



We can thus randomize the scalar multiplication algorithm as follows. We 
perform the scalar multiplication on a random isomorphic elliptic curve and 
then we come back to the original elliptic curve. More formally, if 1 ^ is a ran- 
dom isomorphism from if/jj to E'^^, we propose to compute Q = kP in E(K) 
according to 

Q = p-^{k{piP))), ( 6 ) 

or schematically, 

PeE{K) . Q = kPGE{K) 

T ip-'^ 

P' g e'(K) . Q' = kP' G if'(K) 

Corollaries 1 and 2 indicate that computing the image of a point through 
an elliptic curve isomorphism can be done using only a few elementary field 
operations. This yields a very efficient means to randomize the computation of 
Q = kP. 
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Algorithm 1 (Scalar Multiplication via Random Isomorphic Elliptic 
Curves for CharK ^ 2,3). 

Input: A point P — {xi,yi) € _E(K) with E/k \ y‘ = x'^ + ax + h. 

An integer k. 

Output: The point Q = kP. 

1. Randomly choose an element u G K*; 

2. Form the point P' <— (u~^ xi,u~^ yi); 

3. Evaluate a' <— u~‘^ a; 

4. Compute Q' <— kP' in E'(K) with + a'x + 

5. If {Q' = O) then return Q = O and stop. Otherwise set Q' G- {x'^jy^)] 

6. Return Q = (u^ x'^, y'^). 

In [5, § 5.3], Coron suggests the randomization of projective coordinates in 
order to blind the basepoint P: P = (xi,yi) is randomized into {t'^xi : t^yi : t) 
in Jacobian coordinates (or into {txi : tyi : t) in homogeneous coordinates) for 
some t gK*. The advantage of the proposed countermeasure is that, in Step 2 of 
Algorithm 1, we can represent P' as the projective point P' = (u~'^xi : u~^y\ : 
1), that is, a point with its Z-coordinate equal to 1. This results in a faster scalar 
multiplication algorithm. Using the values of Table 2, we precisely quantify the 
number of (field) multiplications needed to compute Q = kP, considering in 
each case the faster coordinate system. This is summarized in the next table. ^ 



Table 1. Average number of (field) multiplications to compute Q = kP. 





Random, proj. 
a —3 


coord. ([5]) 
a = — 3 


Algorithm 1 


Double- and-add 


n\-\kU 


(U"*) 


16-|fc|2 (U) 


15- |fc|2 


(U"*) 


Double- and-add-or-sub. 


15| • fc 2 


{Jn 


13| • |fe|2 (U) 


12| • |fc|2 


(U"*) 


Double- and-add-always 


25- |fc|2 


ir) 


23-|fc|2 (U") 


21- |fe|2 


{J) 



For fields of characteristic 2, random isomorphisms of elliptic curves cannot 
be considered alone as a means to protect against differential analysis. The x- 
coordinate of basepoint P remains invariant through isomorphism (p (cf. Eq. (4)) 
and so the resulting implementation may still be subject to a differential analysis 
attack. However, it can be combined with other countermeasures to offer an 
additional security level. 

The next section presents a countermeasure that randomizes both the x- 
and the y-coordinates of point P, whatever the characteristic of the field we are 
working with. 



^ Note that parameter b' is not required by the scalar multiplication algorithm. 

^ U, and respectively refer to the Jacobian coordinates, Chudnovsky Jacobian 
coordinates and the modified Jacobian coordinates (see Appendix A.l). 
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4.2 Field Isomorphisms 

Up to isomorphism, there is one and only one finite field L of characteristic p with 
p” elements. Every such field may be viewed as the field generated (over Fp) by a 
root of an irreducible monic polynomial II of degree n. Given II{X), any element 
of L can be represented as a polynomial in Vp[X]/{II). If e G L, we note Dn{e) 
its corresponding representation in ¥p[X]/{II). From another irreducible monic 
polynomial II' (Y) of degree n, we obtain another representation for the elements 
e G L, -&n'{e) G ¥p[Y]/{n'). The fields K := ¥p[X]/{n) and K' := ¥p[Y]/{n') 
being isomorphic, we let (j) denote such an isomorphism from K to K'. The map 
(j) extends to K x K with <j){x,y) = {<j){x) , 4>{y)) . In particular, (j) transforms the 
equation of an elliptic curve over K into the equation of an elliptic curve over 
K', i.e., 

E/k ■ + dixy + a^y = + a2X^ + a^x + 

is transformed into 

E'/k’ ■ y'^ + + <l>{a3)y = x^ + 4>{a2)x^ + (j>{a4)x + (j){a(,) . 

Consequently, isomorphisms between fields can be used to randomize the rep- 
resentation of the basepoint P. To compute Q = kP, we first choose randomly 
a field K' isomorphic to K through isomorphism (j). Then, we compute Q as 

Q = cj>-\k{4>{P))) . (7) 

In other words, we represent P G E(K) as a point P' G E'(K'), next we compute 
Q' := kP' in E'{K.'), and finally we go back to the original representation by 
representing Q' as a point Q G E{K). 

At first glance, it is unclear that this could lead to an countermeasure efficient 
in a constrained environment. Indeed, to build a field K' isomorphic to K, a 
natural way consists in determining an irreducible monic polynomial of degree 
n, W G Fp[y]. An isomorphism (j) is then obtained by computing a root a of 77 
in K': 

n— 1 

(j) : K ^ K' : X 1-^ Xi a ' , (8) 

i=0 

where x = Xi X' G K = Fp[X]/(7T). Likewise, the inverse map, from K' 

to K, requires to find a root /3 of 77' in K. 

However, we can do much better when some permanent writable memory is 
at disposal (e.g., the eeprom in a smart-card implementation). The general idea 
is, given an isomorphism <() : K ^ K' stored in eeprom, to determine from 4> and 
K' a new field K" and a new isomorphism : K ^ K", and so on. This can be 
done thanks to Proposition 1, which yields a recursive method for constructing 
irreducible polynomials of same degree. 

Proposition 1. Let T he a polynomial permutation^ of Fp»> and let 77 he an 
irreducihle polynomial in Fp[A] of degree n. Then polynomial II oT has at least 
one irreducihle factor of degree n, say II' , in Fp[A]. 

® A polynomial T with coefficients in Fp is a polynomial permutation ofFpn if the map 
X !->■ T{x) permutes Fp>i. 
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Proof. Let a be a root of II. As II is irreducible, the orbit of a under the 
action of the Frobenius is of cardinality n. T being a permutation, the image 
of this orbit through T~^ is still of cardinality n. Since T is a polynomial with 
coefficients in Fp, it commutes with the Frobenius, and thus the image of the orbit 
of a through T~^ appears as the orbit of T~^{a). Consequently, the polynomial 
— (T“^(a))P*) is irreducible of degree n, and divides II oT. □ 

Hence, if we choose a polynomial permutation T of small degree (e.g., 2 or 3), 
we compute II oT and factor it with a specific algorithm to find II' . A further 
generalization consists in storing a family of polynomial permutations © = {Ti} 
in EEPROM and to randomly choose T G & when constructing II' . 

We note 77 the publicly known polynomial which defines the field K = 
Fp[X]/(7T) used as the reference field. We assume that another polynomial Il^^i 
defines the field isomorphic to K. We also assume that two polynomials 
ad) , /3d) g Fp[A] of degree at most n verifying Eqs. (9) have initially been 
stored in eeprom. These additional data must of course be kept secret. At the 
j*'' execution of the scalar multiplication algorithm, the eeprom contains an 
irreducible monic polynomial Tjd) g Fp[A] of degree n, and two polynomials 
ad), /3d) g Fp[A] such that 

r7T(/3d)) =0 (TTd)) 

\7Td)(ad)) = 0 (77) ’ 

These relations simply say that ad) and /3d) respectively define an isomorphism 
^d) and its inverse from the field K to the field Fp[A]/(77d)) denoted by K/d). 

We are now ready to give the algorithm. We choose randomly T G 6 and 
determine an irreducible monic polynomial Tfd+i) of degree n in Fp[A] that 
divides Tfd) oT with a method that will be explained later. Then we set /3d“'‘i) = 
/3d) oT mod Tfd+i), = T“^(ad)) mod 77, and we store ad+i), /3d+i) and 

77d+i) in EEPROM. Here, T~^ denotes the permutation inverse of T. It is easy 
to check that ad+i),/3d+i) and 77d+i) still verify Eqs. (9), and thus define 
an isomorphism </>d+i) and its inverse from K to Fp[A]/(77d)). It remains to 
compute P' = ^d'i-i)(p) and the coefficients of 73'. Finally, we compute kP' in 
E' and convert the resulting point by the inverse isomorphism to obtain Q = kP. 
From the viewpoint of the running time, one of the advantages of this method 
is to skip the root finding step. 

We still have to show how to solve Step 2 in the above algorithm. We illustrate 
the technique in the case K = F2[A]/(77) with 77 of degree n and gcd(2” — 1, 3) = 
1 (this case includes the popular choice n = 163 for elliptic curve cryptosystems), 
but we stress that the proposed technique is fully general and can be adapted 
to the other cases, as well. 

First, note that: 

Lemma 1. 7/gcd(2” — 1,3) = 1 then the elements of 

6 = {X3,1 + A^A + A2 + A^1 + A + A2 + X3} CF2[A] 



permute F 2 ™ . 
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Algorithm 2 (Scalar Multiplication via Random Isomorphic Fields). 
Input: A point P = {xi,yi) £ E(K) 

■ k ( ^ /K ■ + xy = + ax^ + b if CharK = 2 

\E/^:y^ =x^ + ax + b ifCharK>3' 

An integer k. 

[eeprom: Polynomials and 

Output: The point Q — kP. 

1. Randomly choose T £ ©; 

2. Determine, in Fp[A], an irreducible monic polynomial s.t. 

divides o T; 

3. Set ^ o T mod 

4. Set mod 77; 

5. Update the eeprom with ^O'+i) and 77^-’"'"^^; 

6 . Form P' ^ P o mod 

7. Evaluate a' <— a o /jU+i) mod 77^-’"'"'^^; 

8 . Compute Q' <— fcP' in F'(Fp[A]/(77(^'+i))); 

9. If (Q' = O) then return Q = O and stop. Otherwise set Q' -(r- (x'^,y'^)\ 

10. Return Q = (* 3 , 2 / 3 ) ° mod 77. 



Proof. Let a be a primitive element of (remember that F 2 ™ = F^n U {0}). 
Then (a^) generates a subgroup of order (2” — 1)/ gcd(2" — 1, 3) = 2” — 1 and so 
is a primitive element or equivalently permutes F 2 ". Suppose that there 
exist a,P G F 2 » s.t. + 1 = (3^ + 1 4=^ = (3^. This implies a = (3 since 

is a permutation polynomial. The remaining cases are proved similarly by 
noting that — j3‘^ = {a — (3Y . □ 

Given a set © of permutation polynomials, write Q := o T for some T £ 6 
and 77 irreducible of degree n. The fact that Q has degree 3n enables us to 
specialize the classical factorization algorithms (see, e.g., [3, p. 125]): 

1. Compute R = — X mod Q; 

2. Then, using Proposition 1, 77' = gcd(Q,7?) is irreducible of degree n in 
F2[A]. 



5 Randomizing the Mnltiplier on ABC Curves 

The other side of countermeasures for elliptic curve cryptography is the intro- 
duction of a random to blind the multiplier during the scalar multiplication. This 
technique is useful to prevent Differential Analysis, but may also contribute to an 
additional security level against Simple Analysis, as the multiplier is in general 
secret. 

The proposed method is specific to ABC curves (see Appendix A. 2 for the 
definitions) where the multiplier first goes through several encoding functions 
before being used in the scalar multiplication loop itself. We take advantage of 
the properties of this encoding to randomize the multiplier. 




Protections against Differential Analysis for Elliptic Curve Cryptography 385 



Building on previous works by Koblitz [8] and Meier-Staffelbach [11], Solinas 
presents in [14] a very efficient algorithm to compute Q = kP on an ABC 
curve. Letting r : (x,y) e- >■ the Frobenius endomorphism, his algorithm 

proceeds as follows. 

1. Compute, in Z[r], k k mod (t" — 1); 

2. Using [14, Algorithm 4], evaluate the t-NAF of k, k = ki t*; 

3. Compute Q kP as Q — ki t^{P)\ 

4. Return Q. 

Our randomization method exploits the structure of Z[r] C End (A). Let 
p G Z[t]. li X = y (mod p(r” — 1)) then x and y still act identically on the 
curve. Consequently, instead of reducing the multiplier modulo r” — 1 (cf. Step 1 
in the previous algorithm), we can reduce it modulo p(t" — 1) where p is a 
random element of Z[t]. The length of the t-NAF produced is approximately 
equal to n + log 2 N(p), which penalizes the scalar multiplication by log 2 N(p) 
additional steps. This enables to control very easily the trade-off between the 
running time and the expected security. Typically, for n = 163, we might impose 
that N(p) « 2"*°, which roughly produces t-NAF of 200 digits (in {—1,0,1}) 
instead of 160 with the deterministic method. The detailed algorithm is presented 
below. 

Algorithm 3 (Scalar Multiplication via Random Exponent Recoding 
for ABC Curves). 

Input: A point P = {x\,y\) G E(F 2 "), an ABC curve. 

An integer k. 

A trade-off parameter I (typically, I — 40). 

Output: The point Q = kP. 

1. Randomly choose an element p G Z[t] with N(p) < 2*; 

2. Compute, in Z[t], k' <— fc mod p(t" — 1); 

3. Evaluate the t-NAF of k,' , k! = ft) t*; 

4. Compute Q -<r- k'i t*(P); 

5. Return Q. 

An interesting feature of this algorithm is that no additional routine needs 
to be implemented. It only requires a slight modification of the deterministic 
version. Furthermore, the random component p is spread over the full length 
of the multiplier. This may be better than simply adding to the multiplier a 
multiple of the order of the curve, as was suggested in [5, §. 5.1]. 

6 Conclusion 

We proposed two new methods to blind the basepoint for an elliptic curve cryp- 
tosystem. These methods come from the idea of transposing the computation 
in another curve through a random morphism. In addition, we presented a new 
technique to randomize the encoding of the multiplier in the case of anomalous 
binary curves. 
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A Mathematical Background 

This appendix details the elliptic curve addition formulae. It also reviews some 
well-known techniques for computing Q = kP in an elliptic curve E(K). An 
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excellent survey by Gordon, including the most recent developments, can be 
found in [6]. 

In the sequel, we only consider the cases of a field K with CharK yf 2, 3 and 
CharK = 2. In these cases, the general WeierstraB equation (cf. Eq. (1)) can be 
simplified considerably through an appropriate admissible change of variables. 
This is explicited in the next proposition. 

Proposition 2 ([12, Theorem 2.2]). The elliptic curves given by the Weier- 
strafi equations 

E/k ■ + cLixy + azy = + 02 X^ + 04 X + oq and 

E'/k ■ y^ + o!\xy + a'zy = x^ + a^x"^ + a'^x + Og 

are isomorphic over K if and only if there exists tt € K* and r, s, t G K such that 
the change of variables 

{x, y) ^ {u^x + r, u^y + u^sx + t) 

transforms equation E into equation E' . Such a transformation is referred to as 
an admissible change of variables. Furthermore, 

ua'i = oi + 2s , 
u^a '2 = 02 — soi + 3r — s^ , 

u^Og = 03 + roi + 2t , 

u‘^a '4 = 04 — S03 + 2ro2 — {t + rs)ai + 3r^ — 2 st , 

^ u®Og = og + ro4 — taz + r^02 — rta\ + r^ —t^ . 



A.l Elliptic Curves over a Field K with CharK yi 2,3 

When the characteristic of field K is different from 2, 3, the WeierstraB equation 
of an elliptic curve can be simplified to: 

Ep^\ y"^ = x^ + ax + b (4o^ + 276^ yf 0) . (10) 

For any P G E(K), we have P + 0 = 0 + P = P. Let P = (xi,yi) and 

Q = {x2,y2) G EfK). The inverse of P is —P = (xi,—yi). If Q = —P then 
P + Q — O; otherwise the sum P + Q = (xg, yz) is given by 

Xz = - Xi - X 2 , yz = A(xi - xg) - yi (11) 



if -P 7 ^ Q , 

if P = Q . 



y2 - yi 

^ ^ 3xf f o 

To avoid the division in the computation of A, one usually works in projective 
coordinates. There are basically two ways to project Eq. (10): (i) set x = X/Z 
and y = YjZ, that is, {X : Y : Z) are the homogeneous coordinates] or (ii) set 
X = X/Z'^ and y = YfZ'^, {X : Y : Z) are then referred to as the Jacobian 
coordinates. Hence, to compute Q = kP on an elliptic curve, one first represents 
point P = (xi,yi) as {Xi : Yi : Zi), computes {X 2 : Y 2 : Z 2 ) = k{Xi : Yi : Zi), 
and recovers Q = {x 2 ,y 2 ) from its projective form. 
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Homogeneous coordinates. In homogeneous coordinates, {X : Y : Z) and 
{tX : tY : tZ) (with t G K*) are two equivalent representations of a same point. 
The point at infinity O is represented by (0 : 1 : 0); it is the only point with 
its Z-coordinate equal to 0. Putting x = XjZ and y = P/Z in Eq. (12), the 
WeierstraB equation of an elliptic curve becomes 

E/k : Y'^Z = X^ Y aXZ'^ + hZ"^ . (12) 

The formula to double a point P = {Xi : Yi : Zi) is 2 P = {X3 : Y3 : Z3), 
where 

Xs = SH, Y3 = W{T - H) - 2 M^ and Z3 = (13) 

with W = 3X^+aZ^, S = 2YiZi, M = Pi S', T = 2XiM and H = W^-2T. This 
requires 12 multiplications. Notice that if a = — 3 then W = 3(Xi — Zi)(Xi + Zi); 
in that case, the number of multiplications decreases to 10. The sum i? = {X 3 : 
Y 3 : Z3) of two points P = {Xi : Yi : Zi) and Q = {X 2 : I 2 : ^ 2 ) (with 
P yf ±Q) is given by 

X3 = WX'3, 2Y3 = RV - MW^ and Z3 = Z'lT^ (I4) 

with Ui = X1Z2, C/2 = X2Z1, Si = YiZ2, S2 = T2^i, T = U1 + U2, W = U1-U2, 
M = Si + S2, R = Si - S2, Z' = Z1Z2, H = TW^, X^ = -H + Z' R^ and 
V = H — 2X3. The addition of two points can thus be done with only 14 
multiplications. If one of the two points has its Z-coordinate equal to 1 then the 
number of multiplications decreases to 11. 



Jacobian coordinates. The use of Jacobian coordinates is suggested in the 
P1363 IEEE Standard [1] because it allows faster arithmetic [2]. In Jacobian 
coordinates, also, the representation of points is not unique, {X : Y : Z) and 
{t^X : t^Y : tZ) (with t € K*) are equivalent representations. The WeierstraB 
equation is given by 



E/k : + aXZ^ + 6Z® (15) 

and the point at infinity is represented by (1 : 1 : 0). 

The double of point P = (Xi : Yi : Zi) is equal to 2 P = (X3 : Y3 : Z3) 
where 



X3 = - 2S, Y3 = M{S - X3) - T and Z3 = 2EiZi (16) 

with M = 3X^ + aZf, S = 4 XiYj^ and T = SY^. So doubling a point requires 
10 multiplications. Here too, we see that the value a = —3 enables to reduce the 
number of multiplications; in this case, it decreases to 8. 

The sum R = (X3 : Y3 : Z3) of points P = (Xi : Yi : Zi) and Q = (X2 : E2 : 
Z2) (with P y/ ±Q) is given by 

X3 = - TW^, 2X3 = RV - MW^ and Z3 = Z1Z2W (17) 
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Table 2. Nnmber of multiplications in addition formnlae. 





Addition 
Z2 ^ 1 Z2 = 1 


Doubling 
a ^ —3 a = — 3 


Homogeneous coord. 


14 


11 


12 


10 


Jacobian coord. 


16 


11 


10 


8 


Chudnovsky Jacobian coord. 


14 


11 


11 


9 


Modified Jacobian coord. 


19 


14 


8 


8 



with Ui = XiZl U2 = X2ZI, Si = YiZ|, S'2 = Y2ZI, T = U1+U2, W = U1-U2, 
M = Si + S2, R = Si — S2 and V = TW^ — 2X3. An addition requires 16 
multiplications. When one of the two points has its Z-coordinate equal to 1 then 
an addition requires only 11 multiplications. A slightly different (but equally 
efficient) formula for addition can be found in [13]. Using the same notations as 
above, the sum R = (X3 : V3 : Z3) is then given by 

X3 = R^-TW‘^, ¥3 = -RX3 + {RUi- S i)W‘^ and Z3 = Z1Z2W . {IT) 

Mixed coordinates. In the general case, we have seen that Jacobian coordi- 
nates offer a faster doubling but a slower addition than homogeneous coordinates 
(see Table 2). Chudnovsky and Chudnovsky [2] proposed to internally represent 
a point (A : y : Z) in Jacobian coordinates as a 5-tuple (A, U, Z, Z^, Z^). In 
Chudnovsky Jacobian coordinates, the addition formula for P — (Ai : li : Zi) 
and Q = (A2 : ¥2 : Z2), respectively represented as (Ai, Yi, Zi, Zf , Zf) and 
(A2, Y2, -^2, -^2 7 remains the same as given by Eq. (17). The advantage is 
that the values of Z^, Zf, Z| and Z| being available, they do not have to be 
computed; only Z| and Z| have to be be computed to represent the result 
R = P + Q = {X3 ■. ¥3 ■. Z3) the 5-tuple (A3, Fa, Z3, Z|, Z|). Therefore, 
Chudnovsky Jacobian coordinates require (4 — 2) = 2 multiplications less than 
ordinary Jacobian coordinates to add two points. On the other hand, the dou- 
bling is more expensive: it requires (2 — 1) = 1 multiplication more for computing 
R = 2P = (A3 : ¥3 : Z3) since Z^ has not to be computed (see Eq. (16)) but 
Z| and Z| have to. 

The above strategy was optimized by Cohen, Miyaji and Ono [4] in order to 
provide the fastest known doubling algorithm on a general elliptic curve. With 
their coordinates, called modified Jacobian coordinates, a point {X \ ¥ : Z) is 
internally represented as a 4-tuple (A, Y, Z, aZ^). A point is doubled with only 
8 multiplications whatever the value of parameter a. However, this fast doubling 
is done at the expense of a slower addition: 19 multiplications are required to 
add two points in the general case and 14 multiplications when one of the two 
points has its Z-coordinate equal to 1. 

A. 2 Elliptic Curves over a Field K with Char K = 2 

For fields of characteristic 2, the simplified WeierstraB equation depends on 
whether the curve is supersingular or not. For cryptographic applications, we 
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are only interested in non-supersingular curves. In that case, it can be shown 
that an admissible change of variables yields the simplified WeierstraB equation 

E/k '-y^ + xy = -1-6 (6 yf 0) . (18) 

O being the neutral element, we have P+O = O+P = Pfor any P G E(K). 
Let P = (xi,yi) and Q = (x2,y2) G E(K). The inverse of P is — P = (xi,xi -I- 
yi). If Q = — P then P + Q = O; otherwise the sum P + Q = (x 3 ,?/ 3 ) is 
calculated as follows. 

— If P ^ Q then 

X 3 = -I- A -I- xi -I- X 2 -I- a , 2/3 = A(xi -I- X 3 ) -I- X 3 -I- (19) 



with A = 



2/1 + 2/2 
Xi + X2 



— If P — Q then 



X 3 — A^ -l- A -l- o , 2/3 — -l- (A -l- l)x 3 (20) 

with A = xi -I- — . 

Xi 

An important subclass of elliptic curves has been introduced by Koblitz in [8]: 
the Anomalous Binary Curves (or ABC curves in short), sometimes also referred 
to as Koblitz curves. These are elliptic curves given by Eq. (18) with 6=1 and 
a € {0, 1}. For such curves, the Frobenius endomorphism, r : (x,y) 1 — (x^,y^), 
satisfies the characteristic equation 



- (~iy-‘^u + 2 = 0 . 



Koblitz suggests to speed the computation of Q = kP by noticing that 
2P = (— l)^““r(P)— T^(P). He also suggests to write k as a, Frobenius expansion 
since scalar multiplication by k is an endomorphism and Z C Z[t] C End(P). 
The ring Z[r] is an Euclidean domain with respect to the norm N(r -|- st) = 
-I- (—1)^““ rs + 2s^. Furthermore, as N(r) = 2, every element r -I- sr in Z[r] 
can be written as a r-adic non- adjacent form (r-NAF, in short), that is, 

i ^ 

As already remarked in [8], the drawback in this method is that the Frobenius 
expansion (21) is roughly twice longer than the usual balanced binary expansion 
and so, even if the evaluation of r is very fast, it is not clear that the resulting 
method is faster. The drawback was loopholed in [11,14] with the following 
observation. We obviously have r" = 1 and thus Q = k'P with k' = k mod (r" — 
1). As N(t" — 1) = #Pa(lF 2 >*) ~ 2” by Hasse’s Theorem, the t-NAF expression 
of k', k' = k[ r*, would have a length approximatively equal to that of the 
(usual) NAF expression of k. The non-adjacency property (i.e., k[ ■ = 0) 

implies that, on average, only one third of the digits are nonzero [6]. Together 
with the property that the evaluation of rP is very fast, this yields a very efficient 
algorithm for computing Q = kP. 
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Abstract. In this paper we show how using a representation of an ellip- 
tic curve as the intersection of two quadrics in P® can provide a defence 
against Simple and Differental Power Analysis (SPA/DPA) style attacks. 
We combine this with a ‘random window’ method of point multiplication 
and point blinding. The proposed method offers considerable advantages 
over standard algorithmic techniques of preventing SPA and DPA which 
usually require a significant increased computational cost, usually more 
than double. Our method requires roughly a seventy percent increase in 
computational cost of the basic cryptographic operation, although we 
give some indication as to how this can be reduced. In addition we show 
that the Jacobi form is also more efficient than the standard Weierstrass 
form for elliptic curves in the situation where SPA and DPA are not a 
concern. 



1 Introduction 

Elliptic curve based cryptosystems are particularly suited for cost-effective im- 
plementations of public key primitives on low powered computational devices 
such as Smart Cards, Mobile Phones and PDAs. Nevertheless, the use of side 
channel information, such as that provided by Simple and Differential Power 
Analysis (SPA/DPA) [7] on naive implementations can lead to the revelation of 
the secrets that the algorithm is working on. 

Elliptic curve systems have the advantage of almost always using a new 
random ephemeral secret integer in the double and add algorithm for each run 
of a protocol, unlike RSA. Hence, a DPA attack on ECC is harder to mount 
for this reason than one against RSA. On the other hand smart card vendors 
require any implementation to be as immune as possible from SPA and DPA. 

One problem with elliptic curve systems is that the doubling operation is 
significantly more efficient than the general addition operation. This needs to be 
compared to the RSA case, where squaring is only slightly more efficient than 
general multiplication. Hence, it may be possible to use SPA to recover some bits 
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of each ephemeral exponent, since one may be able to distinguish an addition 
from a doubling. Recall [6] that for EC-DSA only a few bits of each ephemeral 
exponent need to be leaked in this way per message, for the underlying secret 
key to be revealed. 

Hence, various proposals have been made to completely secure elliptic curve 
systems against SPA and DPA. To protect against DPA it has been proposed 
to use a randomised projective coordinate system. Here the base point on the 
curve P = {x,y) on each protocol run is first randomised by replacing P with 
the (Jacobian) projective point 

P' = (a:z^yz^z), 

or the (homogeneous) projective point 

P" = {xz,yz,z), 

for some random non-zero field element 2 :. This still allows some of the efficient 
techniques for point multiplication to be used, such as those described in [1] 
and [5]. The use of mixed coordinate (i.e. affine and projective coordinates used 
together) multiplication algorithms are, however, not used which causes some 
efficiency loss. 

Moreover, the above defence will not protect against SPA, hence for SPA 
protection one of two defences are usually proposed. The first is as follows, 
instead of computing [k]P one computes [k + rcj\P, where q is the order of P 
and r is some random integer. This defence significantly increases the cost of a 
point multiplication. This does not provide any defence against SPA since if one 
can recover k' = k + rq from a single run then one can recover k = k' (mod q) 
for this run since q is known. A second technique is to take a random integer 
r and compute k' = rk (mod q) and r' = Ijr (mod q). One then computes 
Q = [k']P and then [r']Q = [k]P, again a task which significantly increases 
computational cost. 

Neither of these defences against SPA address the underlying cause, which 
is the disparity between the addition and doubling algorithms. A model for the 
elliptic curve in which addition and doubling are given by the same formulae will 
not suffer from such side channel analysis on the code dependent nature of the 
operation. In this paper we proposed such a model, based on the Jacobi form 
of an elliptic curve. Our model, for certain elliptic curves, will provide a defence 
against SPA and will only give a 70 percent increase in computational cost. 

To understand our defence against SPA we first explain roughly how an SPA 
attack on a standard elliptic curve binary point multiplication method would 
proceed. Recall the binary method for point multiplication proceeds as in the 
following algorithm. 



Binary Multiplication Method 



INPUT: A point P auid an integer k 

OUTPUT: The point Q = [k]P . 
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1 . Q^O. 

2. For i from t down to 0 do: 

3 . Q <— [2]Q . 

4. If {ki = 1) then 

5 . Q i — Q P . 

6 . Return Q . 



With a standard representation an attacker can attempt to determine the bits 
of k by seeing how the program behaves at the z/-statement. The test is always 
carried out but the subroutine for point addition will only be called when the 
tth bit of k is set. The attacker can attempt to spot this jump to a subroutine, 
which will have a different power trace to point doubling, and hence determine 
k. 

The most common idea to make point addition and doubling indistinguish- 
able, is to unify the common code part for both operations, and add dummy code 
to balance the difference between point addition and point doubling. Ideally one 
needs to execute the same code at the same addresses but with different results, 
but this is unfortunatelly not possible if point addition and point doubling are 
not unified. 

Now suppose exactly the same code was called for point addition and point 
doubling with the same power trace profile for both operations. The attacker 
would now need to determine whether one or two calls to this procedure were 
performed on each iteration. This is a much harder problem for SPA to solve, 
but if this is still a worry one can unroll the loop to make this task harder for 
the attacker. But for standard elliptic curve Weierstrass models one cannot use 
the point addition code in the case where the two points are equal, since the 
addition formulae contain a singularity when the inputs are the same. 

Notice that the defence of simply adding spurious multiplication operations 
into the doubling code, as mentioned above, would not be a suitable defence since 
the point doubling and point addition code would still have seperate execution 
profiles, and would reside in different areas of memory or hardware. 

Nevertheless, with the basic double and add algorithm a little bit of informa- 
tion can leak from the bit test, even if the same code is used for point addition 
and doubling. A carefull implementation can make this information not usable 
in practice from the point of view of an attacker. Moreover, we present in the 
last section a multiplication algorithm that reduce significantly the amount of 
information that can leak from point multiplication. 

One is still left open to a DPA style attack whereby internal data bits are 
guessed (depending on whether the if statement produces a branch) and these 
are correlated over a number of runs. However, for ECC systems these are easily 
prevented by point blinding (essentially using the redundancy of a projective 
coordinate representation) or by the protocol using ephemeral point multiples 
on each run. 
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2 Intersection of Two Qnadrics 



Let K denote our ground field, which in applications will be a finite field Fp of 
characteristic greater than three. It is well known that an intersection of two 
quadric surfaces in 

Q ■ {Qi{xo,Xi,X2,X3) = 0} n {Q2{xo,Xi,X2,X3) = 0} 



generically defines a curve of genus one. Hence, assuming Q has a point defined 
over K, the curve Q is birationally equivalent to an elliptic curve, also defined 
over K. 

Just as the chord-tangent law defines a geometric group law on the elliptic 
curve we can also define a group law on Q in geometric terms, see [8]. We first let 
Pq denote our given Jf-rational point on Q, which we shall treat as the identity. 
Three points Pi,P2,Ps G Q{K) will sum to zero if and only if the four points 
Po,Pi,P2 and P3 are coplanar. The negation of a point —Pi is given as the 
residual intersection of the plane through Pi containing the tangent line to Q at 

An algorithm to pass from a general intersection of two quadric surfaces with 
a AT-rational point to an elliptic curve is given in [ 2 , p 36 ]. In [ 3 , pp 63 - 64 ] a 
method is given to pass in the other direction, from a general elliptic curve over 
K 



E -.Y^ = X'^ + AX + B, 



to the intersection of two quadrics given by 



Q: 



zl - Bzl - AZ2Z3 - Z0Z2 = 0 , 

zl - Z0Z3 = 0 . 



The map from a point {X, Y) G E{K) to a point (zq, zi,Z2, Z3) G Q{K) is given 
by Zq = X'^, Zl = Y , Z2 = X and Z3 = 1 . 

Also in [ 3 ] formulae are given to add points on Q{K). If we let a=(ao, 01,02, 03) 
and b = (&o, ^1, ^2, ^3) denote two points on Q{K) then their sum is given by 
c = a -f- b with 

Co = i?(a,b)^. 

Cl = 6iS'(a, b) -I- oi 5 '(b, a), 

C2 = i?(a,b) • T(a, b), 

C3 = T(a, b)^ 

where 

_R(a., b) = — 2y4.(Z2&2 — 

S'(a, b) = Oq^o -I- 2Ao20o62 -I- 4H02O063 + 3A03O060 

-|- 12 i 3 o 30 o 62 ~ 3^^030963 -l- 4H03O2&0 ~ 2^^030262 
— 4AH0302&3 — J ^ a \ b 3 — 85^0363, 

T(a, b) = 2 oi 6 i -I- 0260 T 0,0^2 T A0362 -I- 2H0363. 
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What is remarkable about these equations is that they also hold when a = b, i.e. 
when a doubling operation is performed. Hence, the use of such a representation 
will remove the distinction between doubling and adding, and hence help to 
defeat SPA as argued above. However, the above formulae are overly complicated 
and therefore not particularly suited to a real life implementation, so in the next 
section we reduce to a special class of elliptic curves over K for which the above 
formulae can be made particularly simple leading to efficient implementation. 



3 Jacobi Form 

To make the formulae from the above section more amenable to machine cal- 
culation we require that our quadrics Q be simultaneously diagonalisable over 
K. This is equivalent to saying that our initial elliptic curve has three points of 
order two defined over K, or equivalently that the polynomial + AX + B has 
all three roots defined over K. 

Hence, from now on we shall assume we have chosen an elliptic curve 
E :Y^ = X^ + AX + B 

which has three points of order two defined over K. This means that the group 
order N = ^E{Fp) is divisible by 4, hence we should choose such a curve with 
N = Aq with q a prime. 

By applying a standard Mobius transformation we can move the three points 
of order two to the positions (0, 0), (—1, 0) and (—A, 0) where X £ K. Our elliptic 
curve has then become 



E' : = x{x -|- l)(a: -|- A). 

To obtain this transformation, first write the factorisation of X^ + AX + B over 
K as 

X^ + AX + B={X- 9i){X - 92){X - 6*3). 

Then we define the following Mobius transformation, where {i,j, fc} = {1,2,3}, 



X-9,Z 

Z 

y = Y{9. - 9,fl\ 

where (X,Y,Z) is a homogeneous projective point on E{K) and {x,y,z) is a 
homogeneous projective point on E\K). Then setting 

^ &i~ Sk 



3 
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we see that the curve E is mapped to the curve E' since 

x{x + z){x + z\) = TTT^Tne - O3Z) 

= ( 0 ^ 3 ^ = y^^- 

This change of variable requires that for some 1 < *, j < 3 with i j we have 
that 9i — 9j is a square modulo p. If p = 3 (mod 4 ) then —1 will not be a square 
modulo p and so either 

9i — 9j or 9j — 9i 

will be a square modulo p, for all possible i and j. When p = 1 (mod 4 ) then 
there is a 1/8 chance for given 9i,92,9s that we cannot find a pair of indices 
such that 9i — 9j is a square modulo p. 

In [ 4 ] Chudnovsky and Chudnovsky consider the following intersection of two 
quadrics 

( xl + xl-xl = 0, 

( k^XQ + X2 — = 0. 

From two points (00,01,02,03) and (60,61,62,^3) on Q we can compute their 
sum (cq. Cl, C2, C3) via the formulae 

Co = ^361 • 0962 + O260 ■ O163, 

Cl = 0361 • 0163 — 0260 ■ 0062, 

C2 = 03O26362 — fc^ooOi6o6i, 

C3 = (0361)^ + (0260)^. 

The zero of this group law is given by the point ( 0 , 1 , 1 , 1 ). The above formulae 
for the group law on Q are also valid when (00,01,02,03) = (60,61,62,63), and 
so the same formulae can be used both for doubling and general addition. Each 
addition or doubling can be efficiently implemented so that it requires a total of 
16 field multiplications. 

For use in signed window methods of point multiplication we require the 
formulae for point negation in the Jacobi model. Given the addition formulae 
above it is easy to see that 

-(00,01,02,03) = (-00,01,02,03). 

We now, for a moment, leave our main application of defences against SPA 
and DPA and turn to the use of Jacobi form as a way of speeding up algorithms 
for elliptic curve point multiplication in environments where SPA and DPA are 
not a concern. 

By using the doubling formulae given in [ 4 ] 

Co = 2O1O3 • O2O0, 

Cl = (0103)^ — (0203)^ + (0102)^, 

C2 = (0203)^ ~ (0103)^ + (0102)^, 

C3 = (0203)^ + (0103)^ ~ (0102)^, 
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where (cq, ci, C2, C3) = [ 2 ](oo, ai, 02, 03), we obtain doubling formulae which only 
requires eight field multiplications. 

However, with a little care one can even achieve doubling in seven field mul- 
tiplications, which is more efficient than doubling in projective coordinates on a 
standard Weierstrass equation in odd characteristic. 

Lemma 1 . A point can he doubled in the Jacobi model using seven field multi- 
plications. 

Proof. We first take the doubling formulae obtained from specialising the general 
point addition method to obtain 

Co = 20301 • 0200, 

Cl = (030^1)^ ~ (fi2ao)^) 

C2 = (0302)^ — 

C3 = (0301)^ + (0200)^5 

which requires ten field multiplications to evaluate. Using the equations of the 
curve, 

k'^OQ = 03 — 02 and Oq = 03 — Oi, 
we see that we can, assuming 02 yf 0, rewrite C2 as 

C2 = (0002)^ - (0103)^ -b 2(0102)^. 

Then we can perform a doubling by evaluating 

£i = 0301, 

£2 = 0,002, 

£3 = 2(0102)^, 

Co = 2 £i£ 2 , 

C3 = {£1 + £2)^ — Co, 

Cl = C3 - 2^2, 

C2 = —Cl -b £3. 

It is easy to verify that the same equations hold when 02 = 0 . 

It is interesting to note that this means we can triple a point in 16 -b 7 = 23 field 
multiplications. Note, in [ 4 ] triplication formulae for points in the Jacobi model 
are also given, which also require only 23 field multiplications. 

To use these formulae all that remains is to produce the link between Q and 
E' . The two parameters k and A defining Q and E' are linked by the equation 

A = 1 - 

To describe the map from E' to Q, let {x,y,z) denote a projective point on E' , 

i.e. 

y^z = x{x -b z)(x -b 2:A), 
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such a point is obtained from (X, Y) by generating a random z G K* and 
putting (x, y, z) = {Xz, Yz,z), note this homogeneous projective representation, 
as remarked on both above and below, is needed to prevent DPA attacks. The 
equivalent point on Q is then given by the equations 

xo = -2(x + z)y, 

Xi = —z^ + z'^k'^ + 2z k^x + k^x^ + y"^ — x'^ — 2 z x, 

= A(—x^ — z^ — 2xz) + y^, 

X 2 = —2z k’^x — — z^k"^ + z^ + 2zx + y‘^ + x^, 

= A(x^ + z^ + 2xz) + y'^, 

X3 = —z'^k'^ + k^x^ + z"^ + 2 z X + y'^ + x'^ , 

= Xz^ + + 2xz + (2 — A)x^. 

The reverse operation is obtained by computing 

X = (x2 - xa)A, 
y = xoAfc^, 
z = Xik^ — X2 + X3A. 



Suppose we implemented a standard point multiplication algorithm using a 
signed window method with r = 5, see [1, Algorithm IV. 7], on the elliptic curve 
E over Fp, where p is a 192-bit prime number. This would, on average, require 
191 point doublings and 38 general point additions. The standard projective 
coordinate methods on the curve E require 16 field multiplications to perform 
a general addition and 8 field multiplications to perform a doubling. Hence, the 
average number of field operations required would be 2136. 

Using our Jacobi representation and the same multiplication algorithm we 
would require on average 3664 field multiplications since both doubling and gen- 
eral addition requires 16 field operations. Hence, we obtain about 70 percent 
performance penalty as compared to the standard method. However, since dou- 
bling and addition is performed by the same code we hopefully obtain a better 
defence against SPA attacks. 

If we were not concerned with a defence against SPA/DPA then using the 
Jacobi model we can perform a point multiplication in, on average, 1945 field 
multiplications. This is because we can perform a double in seven field multi- 
plications. Therefore, the Jacobi model gives roughly a ten percent performance 
improvement over the standard Weierstrass model. 

Returning to our main interest of defending against SPA/DPA we can obtain 
a better performance in the following way. We can flip a coin before doubling to 
decide whether we use the 7 or the 16 field operations formulae for doubling. The 
average number of field multiplications then becomes 2040, which is more effi- 
cient than the standard algorithm using a Weierstrass model. Hence, we obtain 
greater efficiency and a defence against SPA/DPA at the same time. 
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Chudnovsky and Chudnovsky [4] give a number of possible other improve- 
ments to multiplication algorithms in Jacobi models. However, they address this 
problem from the point of view of efficiency and not from the point of view 
of minimising the effect of DPA. We leave it as an open research problem to 
reconcile these two approaches for elliptic curves in Jacobi form. 

To protect even further against DPA type attacks we stress we need to per- 
form a method of point blinding, whilst transforming from the standard form 
to the Jacobi form, as above. Assume the affine point P = {X,Y) G E{K) is 
given, on every protocol run one then randomises the representation of P by 
taking a homogeneous representation. This is achieved by generating a random 
element Z' G K* and replacing P by the equivalent point P' = {X',Y'^Z') 
where X' = XZ' and Y' = YZ' . 



3.1 Example Curve 

The prime field Fp defined by 

p = 2192 - 264-1 

is a popular choice for elliptic curve systems, since it offers a number of efficiency 
advantages. For this field one could choose the curve defined by 



A = 421 



which has group order 

6277101735386680763835789423320997497001573836313910896964 
which is four times a 190 bit prime. 



4 Randomised Signed Windows Method 

To add even further defence against side channel analysis we propose the use of 
a signed window multiplication algorithm, which uses a random window width. 
This defence can also be used for standard elliptic curve systems, and not just 
those in the Jacobi model considered above. 

We keep the main signed window algorithm as standard, see for example [1, 
Algorithm IV. 7]. However, we alter the preprocessing of the ‘exponent’, as in 
[1, Algorithm IV. 6], so as to produce a random window width. We assume that 
the system will multiply a fixed point P by a random number k, using a lookup 
table of the point multiples 

P, = [2i + 1]P, 

for 0 < i < — 1. The preprocessing in the signed window algorithm is used 

to express k as 

d-l 

k = Y^ h2^* 

2=0 
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where ei G Z>o and 



b, G {-2^-^ + 1, -2^-1 + 3, . . . , 2^-^ - 3, 2^-1 - 1}. 

Usually one uses fixed window lengths so that e^+i — ei>i?for0<i<(i— 2. 
The following algorithm produces a randomised signed window representation 
of k which will provide a more difficult target for side channel analysis. 



Signed m-ary Window Decomposition 



INPUT: An integer k = kj G {0, 1}, kg = 0 . 

OUTPUT: A sequence of pairs {{bi,ei)}^~Q . 

1 . d 0 , j ^ 0 . 

2 . While j < £ do : 

3. If fcj = 0 then j t— j + 1. 

4. Else do: 

5. r^7^{l, . . . ,i?}. 

6. ft- min{£, j + r - 1}, hd^ {kth-i ■ ■ ■ kj )2 ■ 

1. If hd > then do: 

8 . bd hd — ‘2,'" , 

9. increment the number (kgkg-i ■ ■ ■ kt+i )2 by 1- 

10. Else bd<—hd- 

1 1 . 6d ^ — j t d i — d -p 1 > j ^ — t -p 1 . 

12. Return the sequence (bo,eo),{bi,ei), . . . ,{bd-i,ed-i) ■ 



The only change from the standard algorithm is the addition of line 5, where 
t— Tz denotes a random assignment to the variable on the left from the set on 
the right. 

5 Conclusion 

In this paper we have proposed two new defences against side channel analysis 
for elliptic curve based cryptosystems. Firstly, the use of the Jacobi form for an 
elliptic curve means that the time/power required to perform a point addition 
will be almost identical to that of a point doubling. Such a balanced approach is 
a well known design technique for defeating side channel analysis, and this is the 
first time a truly balanced technique has been proposed for use in elliptic curve 
systems. Secondly, the use of a randomised window method creates another level 
of defence. 

In addition our Jacobi form representation can be made more efficient than 
the standard Weierstrass representation for implementations where SPA and 
DPA are not a concern. 
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Abstract. Side-channel attacks are a recent class of attacks that have 
been revealed to be very powerful in practice. By measuring some side- 
channel information (running time, power consumption, . . . ), an attacker 
is able to recover some secret data from a carelessly implemented crypto- 
algorithm. This paper investigates the Hessian parameterization of an 
elliptic curve as a step towards resistance against such attacks in the 
context of elliptic curve cryptography. The idea is to use the same proce- 
dure to compute the addition, the doubling or the subtraction of points. 
As a result, this gives a 33% performance improvement as compared to 
the best reported methods and requires much less memory. 

Keywords. Elliptic curves, Cryptography, Side-channel attacks. Imple- 
mentation, Smart-cards. 



1 Introduction 

Side-channel attacks are a recent class of attacks that have been revealed to be 
very powerful in practice. By measuring some side-channel information (running 
time, power consumption, . . . ), an attacker is able to recover some secret data 
from a carelessly implemented crypto-algorithm. This paper investigates the 
Hessian parameterization of an elliptic curve as a step towards resistance against 
such attacks in the context of elliptic curve cryptography. The idea is to use the 
same procedure to compute the addition, the doubling or the subtraction of 
points. As a result, this gives a 33% performance improvement as compared to 
the best reported methods and requires much less memory. 

The rest of this paper is organized as follows. The next section introduces 
the theory of elliptic curves and reviews the related work for computing the 
multiple of a point on an elliptic curve. Section 3 presents the Hessian parame- 
terization of an elliptic curve. It also proves some useful results on this special 
parameterization. The side-channel attacks are defined in Section 4 and some 
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countermeasures are discussed. Finally, Section 5 shows how the Hessian param- 
eterization helps to efficiently foil such attacks in the context of elliptic curve 
cryptography. 



2 Elliptic Curve Multiplication 

To ease the exposition we assume throughout this paper that K is a field of 
characteristic p > 3. 



2.1 Basic Facts 



We start with a short introduction to elliptic curves. 

Definition 1. Up to a birational equivalence, an elliptic curve over a field K is 
a plane nonsingular cubic curve with a K-rational point. 

Elliptic curves are often expressed in terms of WeierstraB equations: 

EI/k '■ + ax + b (with 4a^ -I- 276^ 0) (1) 

where a and 6 G K. The condition 4a^ -I- 27b^ yf 0 ensures that the discriminant 

Z\ = -16(4a^ -^276^) (2) 



is nonzero, or equivalently that the points {x, y) on the curve are nonsingular. 

More importantly, together with the point at infinity O, the points of an 
elliptic curve form an Abelian group (with identity element O) under the chord- 
and-tangent rule defined as follows. If P = {xi,yi), then its inverse is given by 
—P = {xi,—yi). The sum of two points P = (a;i,t/i) and Q = {x 2 ,y 2 ) (with 
Q yf —P) is equal to = (xa, yfi) where 

X3 = - Xi - X 2 and 7/3 = A(xi - X3) - yi 



with A = 



3x? -I- a , 

— if Xi=X2, 

2yi 

— — otherwise . 



K Xi — X2 

The previous formulae require 2 or 3 multiplications and 1 inversion to add 
two points. Since this latter operation is costly (an inversion roughly takes the 
same amount of time as 23 multiplications [8]), projective representations of 
WeierstraB equations may be preferred. 



3 Hessian Curves 

In this section, we formally define the Hessian elliptic curves [10] (see also [2, 
p. 36] and [15]) and give some results on this special parameterization. 
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Definition 2. An Hessian elliptic curve over K is a plane cubic curve given by 
an equation of the form 

E/^ : + 1 = 3Duv , (3) 

or in projective coordinates, 

E/^ : [/3 + y3 = 2,DUVW (4) 

where D gK and 1. 

As shown in the next lemma, the condition ^ 1 imposes that the curve 
is nonsingular, that is, elliptic. 

Lemma 1. An Hessian cubic curve Ad(K) is singular if and only if = 1. 

Proof. Let P = {Ui : Vi : W\) be a singular point. Then Uf — DV\Wi = 
Vf - DUiWi = Wf - DViWi = 0, hence Uf = 0). Therefore 

there exist k gK* and r,s,t G Z 3 such that Ui = koj^, Vi = kuj^ and Wi = kut* 
where w is a non-trivial cubic root of unity. Together with Eq. (4), this yields 



3fc3 = 3 D/c 3|^’'+®+*^ or equivalently, = 1. □ 

Proposition 1. The Hessian curve given by Eq. (3) is birationnally equivalent 
to the Weierstrafi equation 

7/2 = a;3 - 27D(D3 + S)x + 54(L»6 - 20£|3 - 8) , (5) 

under the transformations 

{u, v) = {rj{x + 9Z?2), — 1 + rj{3D^ — Dx — 12)) (6) 

and 

(a;,7/) = (-9£>2 + ^'u,3C(w- 1)) (7) 

where n - 6(^^-i)fa+9^=^-3gx-36) , . _ i 2 (D^-i) 

wneie q — ^ ~ Du+v+i ■ 

Proof. Sending the point Pq = (0j~l) to the origin via the map u 1 — >■ u — 1, 
Eq. (3) becomes 

Eci(w,-c) = 0 (*) 

where cz{u,v) = u^ + v^, C 2 (u,v) = —3v{Du + v) and ci(u,v) = 3(Eu + v). 
The slope A of the tangent at Pq is equal to —D. Letting d{u,v) = C 2 (u, — 



4ci(u, v)c 3 {u, v), we have d{u, Au+l) = 12{D^ — l)u^ —27 D^u'^ + 18Du— 3. Hence, 
by Nagell reduction (see Theorem 7.4.9 in [5, p. 393]) and letting B = 12(p3_i)^ 
Eq. (*) is birationnally equivalent to y^ = x^ — 27D“^x^ + ISDBx — 3B^ under 
the transformations 

,,'1 _ ( x{By-C2ix,\x+B)} {Xx+B)(By-C2{x,\x+B)) \ 

\a,u) — y 2 c3(x,Xx+B) > 2c3(x,Xx+B) 1 

( Bx{y-\-ZB—ZDx) B{B — Dx){y-\-ZB—ZDx)\ 

~ \2{x^^{B-Dx)^) ’ 2{x^ + {B-DxY) ) 
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and, noting from Eq. (*) that 2c^{u,v) + C 2 {u,v) = —2ci{u,v) — C 2 {u,v), 

,0 _ ( Bu B(2 c3(u,v)+C2(u,v)) \ _ n2{D^-l)u 36(£>^-l)(t;-2) \ 

\'^iy) Vd — Alt’ (d — Aia)2 / V Du-\-v ’ Du-\-v ) ' 

Replacing now x by x + we finally obtain the required equation and the 
corresponding transformations. □ 

A ‘straight-forward’ application of the chord-and-tangent rule yields rather 
cumbersome formulae for the doubling and the addition on an Hessian curve. 
The correct way is to use the Cauchy-Desboves’ s formulae (see Appendix A), 
which exploit the symmetry of Eq. (4). Plugging W = 0 into Eq. (4), we get the 
point at infinity O = (1 : — 1 : 0). The inverse of O is O. For P yf O we can 
work in affine coordinates. Let P = (ui,vi) be a point on the curve. The line 

V = —u + {u\ + ui) contains the point P and, considering its projective version 

V = —U + {ui +Vi)Z, it also contains the point at infinity O. Therefore, —P is 
the third point of intersection of this line connecting P and O with the curve. 
Substituting v = —u + {u\ + ui) into Eq. (3), we obtain 

+ {—u + {ui + ui))^ -I- 1 = 5Du{—u + {ui + ui)) 

4=^ 3(ui -I- ui -I- D)u^ — 3(ui -I- Ui)(mi -I- Ui -I- D)u + {ui + -1-1 = 0 

4=^ — {Ui + Vi)u + UiVi = 0 . 

(Note that ui + vi + D ^ 0 because yf 1.) So, the w-coordinate of — P is vi, 
and hence its w-coordinate is u\, i.e., — P = (ui, mi) or, in projective coordinates, 

-P= (El : Pi : VEi) . (8) 

We use the same notations as in Appendix A. The tangent at P = (Pi : 
El : lEi) intersects the curve at the third point — 2P whose coordinates are 
given by Eq. (15), with F{U,V,W) = - iDUVW . We have 

if = 3(Pi^ - PEilEi), X = 3(Ei^ - PPilEi) and ip = 3(lEi^ - PPiEi) and so, 
— 2P = {{ip^ — x^)/Ui^ ■ {—ip^ + <p^)/Ei^ : (x^ — F^)/Wi^). A short calculation 
gives 

i/;3 - = 27(lEi^ - PPiEi)^ - 27(Ei^ - DUiWif 

= 27[(lEi® - El®) + P®Pi®(TEi® - El®) - 3PPiEilEi(lEi® - Ei®)] 

= 27(lEi® - Ei®)(P® - l)Pi® , 

and, by symmetry, —ip^ = 27(Pi® — lEi®)(P® — l)Ei® and ~ = 

27(Ei® — Pi®)(P® — l)lEi®. Hence, with Eq. (8), we finally obtain 

2P = (El (Pi® - lEi®) : Pi(TEi® - Ei®) : lEi(Ei® - Pi®)) . (9) 

From Eq. (16) (in Appendix A), the line connecting the points P = (Pi : Ei : 
lEi) and Q = (P 2 : V 2 : IE 2 ) intersects the curve at the third point — (P + Q) = 
(Pi0 - U 2 T : Vi9 - V 2 T : lEi0 - 1 E 2 T), where 0 = 3Pi(P2^ - DV 2 W 2 ) + 
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Wi{V 2 ^ -DU 2 W 2 ) + iWi{W 2 ^ -DU 2 V 2 ) axidT = W 2 {Ui^ -DViWi) + ‘iV 2 {Vi'^ - 
DUiWi) + ?>W 2 {Wi^ - DUiVi). We have 

UiO - U2T = 3ViV2{UiV2 - U 2 V 1 ) 

+ iWiW 2 {UiW 2 - U 2 W 1 ) - 3L>(C/i V 2 W 2 - U 2 ^ViWi) , 

Vi0 - V2r = 3UiU2{U2Vi - U 1 V 2 ) 

+ 3WiW2(Vi W 2 - V 2 W 1 ) - 3D{Vi^U2W2 - V 2 ^UiWi ) , 
WiO - W 2 T = WiU 2 {U 2 Wi - U 1 W 2 ) 

+ WiV 2 {V 2 Wi - V 1 W 2 ) - 3D{Wi^U2V2 - W 2 ^UiVi) , 

and thus, exploiting the fact that P and Q belong to the curve [9, no. 12], we 
obtain 

Ui0 - U 2 T [7i V 2 W 2 - U2^ViWi 
Wi0 - W2Y ~ Wi^U2V2 - W2^UiVi ’ 

Vi0 - V 2 T Vi^U2W2 - V2^UiWi 
Wi0 - W2Y ~ Wi^U2V2 - W2^UiVi 

Therefore, with Eq. (8), the sum R = P + Q is given by 

R= [Vi'^U2W2-V2^UiWi:Ui^V2W2-U2^ViWi:Wi^U2V2-W2^UiVi) . (10) 

We now study the points of order 2 and 3. We work in affine coordinates 
since we are looking at points P ^ O such that 2P = O or 3P = O. Let 
P = (mi,ui). The condition 2P = O is equivalent to P = —P. Therefore, since 
— P = the points P = of order 2 are those for which ui = vi. 

Suppose P = (ui,!;!) with m yf v\, that is, P, 2P yf O. To find the points 
P of order 3, we use the doubling formula: 3P = O 4=^ 2P = — P. So, a few 
algebra shows that the points of order 3 are exactly those with mi = 0 or rii = 0. 
In particular, the points (0,-1) and (—1,0) have order 3. 

Finally, it is interesting to note that a generic point P = {U :V : W) on the 
Hessian curve (4) satisfies 

{D'^ + D+l){U + V+Wf = ■i{DU+V+W){U+DV+W){U+V+DW ) , (11) 

since +D + l){U + V + Wf -^DU +V +W){U + DV + W){U +V + DW) = 

{D-lf{U^ + V^ + W^-?,DUVW) = 0. Moreover, since {DU + V + W) + {U + 
DV + W) + {U +V + DW) = {D + 2){U + V + W), it follows that 

{U +V + wf = 3DUVW , (12) 

u = DU +V + W 

V = u + DV + W andP=^i?g^. 

W = U + V + DW 



where 
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4 Side-Channel Attacks 

At CRYPTO ’96 and subsequently at crypto ’99, Kocher et al. introduced a 
new class of attacks, the so-called side-channel attacks. By measuring some side- 
channel information (e.g., timing [11], power consumption [12]), they were able 
to find the secret keys from tamper-resistant devices. 

When only a single measurement is performed the attack is referred to as 
simple side-channel attack, and when there are several correlated measurements 
sometimes it is referred to as a differential side-channel attack. The main con- 
cern at the moment for public-key cryptography are the simple side-channel 
attacks [12]. Efficient countermeasures are known for exponentiation-based cryp- 
tosystems (e.g., [4]), but they require the atomic operations to be indistinguish- 
able. For elliptic curve cryptography, the atomic operations are addition, sub- 
traction and doubling of points. Within the Weierstrafi model, as suggested 
in [1], these operations appear to be different and some secret information may 
therefore leak through side-channel analysis. 

The next section shows that the Hessian parameterization allows one to im- 
plement the same algorithm for the addition (or subtraction) of two points or 
for the doubling of a point. 



5 Implementing the Hessian Curves 

Figure 1 gives a detailed implementation to add two (different) points on an 
Hessian curve. The algorithm requires 12 multiplications (or 10 multiplications 
if one point has its last coordinate equal to 1) and 7 temporary variables. 



Input : P = (Hi : Fi : Wi) and Q = (t/2 : F2 : W2) with P / Q 

Output : P -f Q = (H3 : V3 : IT3) 

T\ i — Hi ; T2 ^ — Vi ; Tz i — IFi T4 i — H2 ; Tz i — V2 ; Tq < — IT2 
TV 7 i • Te (= U1W2) 

Ti^Ti-Ts (=HiF2) 
n^Tz-n (=WiV2) 

Tz Tz ■ T4 (= IT1H2) 

T 4 ^T 2 -T 4 (=FiH2) 

T2^T2-n a=ViW2) 

Te^T2-T7 (.= UiViWi) 

T 2 ^T 2 -T 4 (=lfH2lT2) 

T 4 ^Tz-T 4 (=FiWiH|) 

Tz^Tz-n (.= W?U2V2) 

Ts^Ti-Ts a=UiWiVi) 

Ti^Ti-Tt (.= U?V2W2) 

Ti ^ Ti - T4 ; T2 T2 - Ts ; T3 T3 - Te 
H3 T2 ; V3 Ti ; Wz <— Tz 



Fig. 1. AddHesse(P, Q): Addition algorithm on an Hessian curve. 
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We note that there are variants for this implementation. For instance, we 
are able to describe similar implementations with only 4 auxiliary variables and 
18 multiplications, 5 auxiliary variables and 16 multiplications, and 6 auxiliary 
variables and 14 multiplications. 

More remarkably, owing to the high symmetry of the Hessian parameteriza- 
tion, the same algorithm can be used for doubling a point. We have: 

Proposition 2. Let P = {Ui : Vi : Wi) be a point on an Hessian elliptic curve 
EjyiJK). Then 

2{Ui : Vi : Wi) = (Wi : Ui : Vi) + (Vi : Wi : Ui) . (13) 

Furthermore, we have : C/i : Mi) yf (Vi : Wi : Ui). 

Proof. Addition formula (10) yields {W\ : Ui : Vi)-|-(Vi : Wi : Ui) = {Ui^ViUi — 
Wi^WiVi : Wi^WiUi - Vi^UiVi : Vi^ViWi - Ui^WiUi) = (Vi(Ui^ - Wi^) : 
Ui{Wi^ - Vi^) : - Ui^)) = 2{Ui : Mi : Wi) by Eq. (9). 

The second part of the proposition follows by contradiction. Suppose that 
(IMi : Ui : Ml) = (Mi : IMi : C/i), i.e., that there exists some t G K* s.t. IMi = tV\, 
Ui = tWi and Mi = tU\. This implies TMi yf 0 and t^ = 1. Moreover, since 
(C/i : Ml : IMi) G i?£)(K), + Mi^ -I- IMi^ = iDUiViWi, which in turn implies 

\t^ + t^ + l)Wi^ = SHt^lMi^ and thus U = 1, a contradiction by Lemma 1. □ 

In [13], Liardet and Smart suggest to represent elliptic curves as the inter- 
section of two quadrics in as a means to protect against side-channel attacks. 
Considering the special case of an elliptic curve whose order is divisible by 4 
(i.e., the Jacobi form), they observe that the same algorithm can be used for 
adding and doubling points with 16 multiplications (see also [3] for the for- 
mulae). Using the proposed Hessian parameterization, only 12 multiplications 
are necessary for adding or doubling points. The Hessian parameterization gives 
thus a 33% improvement over the Jacobi parameterization. Another advan- 
tage of the Hessian parameterization is that points are represented with fewer 
coordinates, which results in substantial memory savings. 

Finally, contrary to other parameterizations, there is no (field) subtraction to 
compute the inverse of a point (see Eq. (8)) . Hence, our addition algorithm can be 
used as is for subtracting two points P = {Ui : Mi : IMi) and Q = {U 2 ■ M : W 2 ) 
on an Hessian elliptic curve: 

{Ui : Ml : Wi) - {U 2 : M 2 : IM 2 ) = (C/i : Mi : IMi) -k (M 2 : U 2 : IM 2 ) . (14) 

To sum up, by adapting the order of the inputs accordingly to Eq. (13) or 
(14), the addition algorithm presented in Fig. 1 can be used indifferently for 

— adding two (different) points; 

— doubling a point; 

~ subtracting two points; 

with only 12 multiplications and 7 auxiliary variables including the 3 result 
variables. This results in the fastest known method for implementing the elliptic 
curve scalar multiplication towards resistance against side-channel attacks. 
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A Cauchy-Desboves’ s Formulae 

Let F{U,V,W) = Q he the (homogeneous) equation of a general cubic curve and 
let Pi = {Ui : Vi : Wi) and P 2 = (C /2 : F 2 : W 2 ) be two points on the curve. 

We let denote (p = X = ~ ■ Then the tangent 

at Pi intersects the curve at the third point [9, Eq. (16)] given by 



/ P(0,V;,-X) . F{-^,0,p) F{x,-^,0) \ 

V ■ Vi^ ■ Wi^ J 



(15) 



Moreover, the secant joining Pi and P 2 intersects the curve at the third 
point [9, Eq. (16)] given by 



(Pi0 - U 2 T : Vi0 - V 2 T : WiO - W2T) , (16) 



where 0 = U\ 



W 2 



dF(Pi) 
dW ■ 



dF{P2) 

dU 



+ Fl 



dF(P2) 

dV 



+ IE 1 



dF(P2) 

dW 



andr = P2^+E2m + 



B Samples 

Here are two examples of cryptographic Hessian elliptic curves EjjiWp) defined 
over the prime field Fp with p = — 2933 and p = 2^^^ — 2^° — 1, respectively. 

Both curves are adapted from [6] using Proposition 1. Note that since (0,-1) 
is on the curve whatever the values of D and p and that this point has order 
3, the order of an Hessian curve, ^Eu{¥p), is always a multiple of 3. Note also 
that this specialized representation does not impact the security of the resulting 
cryptographic applications. 



B.l 160-Bit Prime 

p = - 2933 

D = 945639186043697550302587435415597619883075636292 

#Ed(¥p) = 3 • 5 • 157 • 620595175087432237029165529381611169224913337 

B.2 224-Bit Prime 

p = 2^2^ - 2^° - 1 

D = 25840187014857916932759133078916563544400020237401312879815735566345 
#Sd(Fp) = 3 • 23 • 39072386474131362021256543604376277771282351667\ 

3432244734573782061 
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